RunPod•15mo ago

worker-vllm cannot download private model

I built my model successfully and it was able to download it during the build. However, when I deploy it on RunPod Serverless, it fails to startup upon request because it cannot download the model.

export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token"

docker build -t user/app:0.0.1 \
    --secret id=HF_TOKEN \
    --build-arg MODEL_NAME="my_model_path" \
    ./worker-vllm

export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token"

docker build -t user/app:0.0.1 \
    --secret id=HF_TOKEN \
    --build-arg MODEL_NAME="my_model_path" \
    ./worker-vllm

35 Replies

ashleyk•15mo ago

@Alpay Ariyak any idea?

Alpay Ariyak•15mo ago

Try also specifying the model name as an environment variable within the endpoint template

Casper.OP•15mo ago

I just followed the guide on the worker-vllm repository

Casper.OP•15mo ago

Shouldn't this work?

Casper.OP•15mo ago

One of the serverless nodes printed this which seems correct (removed the model)

{
   "model":"<my_model_path>",
   "download_dir":"/runpod-volume",
   "quantization":"None",
   "load_format":"auto",
   "dtype":"auto",
   "disable_log_stats":true,
   "disable_log_requests":true,
   "trust_remote_code":false,
   "gpu_memory_utilization":0.95,
   "max_parallel_loading_workers":48,
   "max_model_len":"None",

{
   "model":"<my_model_path>",
   "download_dir":"/runpod-volume",
   "quantization":"None",
   "load_format":"auto",
   "dtype":"auto",
   "disable_log_stats":true,
   "disable_log_requests":true,
   "trust_remote_code":false,
   "gpu_memory_utilization":0.95,
   "max_parallel_loading_workers":48,
   "max_model_len":"None",

But somehow, it is still not able to find the model

Alpay Ariyak•15mo ago

Could you share the error message you get please @casper_ai I think I got it, will push a fix soon Hi @Casper. , just pushed the update to main

Casper.OP•15mo ago

Thanks for making the update! I will test later today

Alpay Ariyak•15mo ago

Of course! Pushed custom jinja chat templates as well

Casper.OP•15mo ago

@Alpay Ariyak I'm still getting the same error, although I see the path has changed in the vLLM config

message.txt

Alpay Ariyak•15mo ago

It seems like the issue is that your model doesn’t have a tokenizer

Casper.OP•15mo ago

That can't be right, the tokenizer is there It says 401 unauthorized So is the issue that you are not downloading the tokenizer into the directory perhaps? I thought the idea with downloading the model into the image was to 1) reduce startup time, and 2) having a secure environment with no access to your Huggingface token

Casper.OP•15mo ago

https://github.com/runpod-workers/worker-vllm/pull/39

GitHub

Download tokenizer upon build by casper-hansen · Pull Request #39 ·...

This downloads the tokenizer when building the worker-vllm image. This has the following benefit: You do not have to send any network request to Huggingface during initialization. This means you d...

Casper.OP•15mo ago

Please review @Alpay Ariyak 🙂

Alpay Ariyak•15mo ago

Looks good, have you built an image with it and tested it?

Casper.OP•15mo ago

I tested that it downloads the tokenizer to the directory specified, but have not ran a deployment yet I can run a deployment to test it Any tips on how to push images with models inside to Docker Hub faster? Takes like half an hour even though my internet is speedy

Alpay Ariyak•15mo ago

I can test it in a bit, no worries Unfortunately not Is there a reason you’re baking the model in vs using the pre-built image? I haven’t seen too much of a difference in load times thus far

Casper.OP•15mo ago

It's just the best way currently. You avoid all outgoing traffic this way I think a better alternative could be to use the snapshot_download functionality from Huggingface Hub instead That way, you make sure you download everythign Tbh I don't think this PR will solve the issue because it also needs the config and everything else Okay, I replaced the current download with snapshot_download. Building and deploying in a moment

Alpay Ariyak•15mo ago

Will get back to you, unavailable for the next 2 hours

Casper.OP•15mo ago

I tried using the PR and I think it must be something else too

message.txt

Casper.OP•15mo ago

I'm not sure why but the engine.py is not able to find the tokenizer

Alpay Ariyak•15mo ago

Yeah, tokenizer needs to point to the downloaded path, I’ll fix this up shortly

Casper.OP•15mo ago

Wait I might have just compiled from the wrong branch I think I compiled this from the main branch instead of my PR Thanks, it would be great! I'm still testing my PR as the problem is also that worker-vllm currently does not download the tokenizer or any config files, which will inevitably lead to an error when dealing with private repositories

Alpay Ariyak•15mo ago

Of course! Will fix that for sure Although what would be the issue with just specifying the env variable HF_TOKEN in the endpoint template? It would allow for the tokenizer download, and tokenizers are always tiny, so should be a quick download

Casper.OP•15mo ago

I do not want to add a HF_TOKEN to a production environment since it grants access to everything meant to be kept private And it will also help with reducing delay as much as possible

Alpay Ariyak•15mo ago

So I think we should keep the model download the same and separately snapshot download the tokenizer This is because if we snapshot download the entire repo, it will download all formats of the model (e.g. both .bin and .safetensors), which would bloat the image size

Casper.OP•15mo ago

It's not only the tokenizer that should cause issues right? vLLM also needs to load the model config Otherwise, I do agree that we can try to minimize bloating the image size

Alpay Ariyak•15mo ago

Hmm, didn’t notice any issues with lack of model config yet, but could be an issue potentially as I did not test private models If anything, we can just have a priority list of formats (+ability to specify directly with LOAD_FORMAT env var) and download this way, similarly to vllm, but also allowing json files and whatever else is needed

Casper.OP•15mo ago

I would suggest you just upload opt-125m or something small to test with I modified the PR to only download tokenizer and config

    snapshot_download(
        model,
        cache_dir=download_dir,
        allow_patterns=[
            "*token*",
            "config.json",
        ]
    )

    snapshot_download(
        model,
        cache_dir=download_dir,
        allow_patterns=[
            "*token*",
            "config.json",
        ]
    )

Casper.OP•15mo ago

I updated the PR and testing again now https://github.com/runpod-workers/worker-vllm/pull/39

GitHub

Download full repository upon build by casper-hansen · Pull Request...

This downloads the full repository when building the worker-vllm image. This has the following benefit: You do not have to send any network request to Huggingface during initialization. This means...

Casper.OP•15mo ago

A feature could be ALLOW_PATTERNS that you specify as a comma separated list if you want to overrule

Alpay Ariyak•15mo ago

Sorry, got really busy, just pushed a commit to the PR, but won't be able to test until tomorrow, let me know if you have any thoughts Pushed another commit and merged into main

Casper.OP•15mo ago

Thanks! Got it working and it runs pretty smooth now

Alpay Ariyak•15mo ago

Happy to hear that!

kopyl•14mo ago

@Alpay Ariyak so what was the issue?

Gaming

Programming

worker-vllm cannot download private model

Did you find this page helpful?