Deepseek coder on serverless
Hello, new serverless user here, i would be using the vllm worker, so whenever it gets spun up from a coldstart, does it have to download the model everytime? Id be running it in fp16 which means it be about 14gb of data to download
6 Replies
If ur script says so then yes
So u can either bake into ur docker image or use network storage
to persist ur model between runs
network storage had some impact to speed due to being on an external drive essentially but can still be decent
What you can do to make it easy on yourself is that if you have a Docker file, write a simple bash script to trigger a tiny python script to do a VLLM job like "hello world"
and it will "automatically" go and download the models and stuff, into the docker file during build itme 🙂
or again, network volume
oh really
neat
thanks
**i could be wrong on this for VLLM actually lol. I wonder if VLLM will crash bc of no GPU, ive remember it has done that
There might be other ways to do it, like prob could just download the model urself where VLLM expects it but idk how VLLM downloads / prepares models using an HF download or curl whatever.
I’ve been following the instructions for ‘option 2’ on this page: https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
It’s like ‘open a folder, github bash clone the repo, open the command line, put in that one line
Windows doesn’t need sudo. Model name is copied using the huggingface button. Username/image:tag needs to be your username and chosen image name/tag (I’m sure you know this already) and to be in all lower-case, and runpod requires a tag (I’ve just been mostly using 0.1 so far)
It’s been working.
edit: I put the name in for deepseek coder awq quantized. I have not tried this one personally. Note that GGUF quants won’t work with vLLM afaik.
If you attach a network volume to the endpoint, then the model will only be downloaded once, as long as you’re using our vLLM worker.