worker-vllm cannot download private model
I built my model successfully and it was able to download it during the build. However, when I deploy it on RunPod Serverless, it fails to startup upon request because it cannot download the model.
35 Replies
@Alpay Ariyak any idea?
Try also specifying the model name as an environment variable within the endpoint template
I just followed the guide on the worker-vllm repository
Shouldn't this work?
One of the serverless nodes printed this which seems correct (removed the model)
But somehow, it is still not able to find the model
Could you share the error message you get please @casper_ai
I think I got it, will push a fix soon
Hi @Casper. , just pushed the update to main
Thanks for making the update! I will test later today
Of course! Pushed custom jinja chat templates as well
@Alpay Ariyak I'm still getting the same error, although I see the path has changed in the vLLM config
It seems like the issue is that your model doesn’t have a tokenizer
That can't be right, the tokenizer is there
It says 401 unauthorized
So is the issue that you are not downloading the tokenizer into the directory perhaps?
I thought the idea with downloading the model into the image was to 1) reduce startup time, and 2) having a secure environment with no access to your Huggingface token
GitHub
Download tokenizer upon build by casper-hansen · Pull Request #39 ·...
This downloads the tokenizer when building the worker-vllm image.
This has the following benefit:
You do not have to send any network request to Huggingface during initialization. This means you d...
Please review @Alpay Ariyak 🙂
Looks good, have you built an image with it and tested it?
I tested that it downloads the tokenizer to the directory specified, but have not ran a deployment yet
I can run a deployment to test it
Any tips on how to push images with models inside to Docker Hub faster? Takes like half an hour even though my internet is speedy
I can test it in a bit, no worries
Unfortunately not
Is there a reason you’re baking the model in vs using the pre-built image? I haven’t seen too much of a difference in load times thus far
It's just the best way currently. You avoid all outgoing traffic this way
I think a better alternative could be to use the snapshot_download functionality from Huggingface Hub instead
That way, you make sure you download everythign
Tbh I don't think this PR will solve the issue because it also needs the config and everything else
Okay, I replaced the current download with snapshot_download. Building and deploying in a moment
Will get back to you, unavailable for the next 2 hours
I tried using the PR and I think it must be something else too
I'm not sure why but the
engine.py
is not able to find the tokenizerYeah, tokenizer needs to point to the downloaded path, I’ll fix this up shortly
Wait I might have just compiled from the wrong branch
I think I compiled this from the main branch instead of my PR
Thanks, it would be great! I'm still testing my PR as the problem is also that worker-vllm currently does not download the tokenizer or any config files, which will inevitably lead to an error when dealing with private repositories
Of course! Will fix that for sure
Although what would be the issue with just specifying the env variable HF_TOKEN in the endpoint template? It would allow for the tokenizer download, and tokenizers are always tiny, so should be a quick download
I do not want to add a HF_TOKEN to a production environment since it grants access to everything meant to be kept private
And it will also help with reducing delay as much as possible
So I think we should keep the model download the same and separately snapshot download the tokenizer
This is because if we snapshot download the entire repo, it will download all formats of the model (e.g. both .bin and .safetensors), which would bloat the image size
It's not only the tokenizer that should cause issues right?
vLLM also needs to load the model config
Otherwise, I do agree that we can try to minimize bloating the image size
Hmm, didn’t notice any issues with lack of model config yet, but could be an issue potentially as I did not test private models
If anything, we can just have a priority list of formats (+ability to specify directly with LOAD_FORMAT env var) and download this way, similarly to vllm, but also allowing json files and whatever else is needed
I would suggest you just upload opt-125m or something small to test with
I modified the PR to only download tokenizer and config
I updated the PR and testing again now
https://github.com/runpod-workers/worker-vllm/pull/39
GitHub
Download full repository upon build by casper-hansen · Pull Request...
This downloads the full repository when building the worker-vllm image.
This has the following benefit:
You do not have to send any network request to Huggingface during initialization. This means...
A feature could be
ALLOW_PATTERNS
that you specify as a comma separated list if you want to overruleSorry, got really busy, just pushed a commit to the PR, but won't be able to test until tomorrow, let me know if you have any thoughts
Pushed another commit and merged into main
Thanks! Got it working and it runs pretty smooth now
Happy to hear that!
@Alpay Ariyak so what was the issue?