Incredibly long startup time when running 70b models via vllm

I have been trying to deploy 70b models as a serverless endpoint and observe start up times of almost 1 hour, if the endpoint becomes available at all. The attached screenshot shows an example of an endpoint that deploys cognitivecomputations/dolphin-2.9.1-llama-3-70b . I find it even weirder that the request ultimately succeeds. Logs and screenshot of the endpoint and template config are attached - if anyone can spot an issue or knows how to deploy 70b models such that they reliably work I would greatly appreciate it. Some other observations: - in support, someone told me that I need to manually set the env BASE_PATH=/workspace, which I am now always doing - I sometimes but not always see this in the logs: AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'..., even though I am deploying a completely different model - I sometimes but not always get issues when I don't specify the chat template
[rank0]: TypeError: expected str, bytes or os.PathLike object, not dict\n
2024-11-12 12:59:15.351
[rank0]: with open(chat_template, "r") as f:\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/chat_utils.py", line 335, in load_chat_template\n
[rank0]: self.chat_template = load_chat_template(chat_template)\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 73, in __init__\n
[rank0]: TypeError: expected str, bytes or os.PathLike object, not dict\n
2024-11-12 12:59:15.351
[rank0]: with open(chat_template, "r") as f:\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/chat_utils.py", line 335, in load_chat_template\n
[rank0]: self.chat_template = load_chat_template(chat_template)\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 73, in __init__\n
10 Replies
nielsrolf
nielsrolfOP2mo ago
GitHub
Randomly the machine get stuck on loading model · Issue #112 · runp...
Hi, as the title suggests completely random, the machine gets stuck on Using model weights format ['*.safetensors'] and I have to manually terminate the worker and restart it. Do you have a...
Poddy
Poddy2mo ago
@nielsrolf
Escalated To Zendesk
The thread has been escalated to Zendesk!
nerdylive
nerdylive2mo ago
I'll try to open a ticket, you can check from that button
nielsrolf
nielsrolfOP2mo ago
Thanks, it now says Ticket created
Madiator2011 (Work)
usually you dont want download models on sending request
nielsrolf
nielsrolfOP2mo ago
Yes it would indeed be better if that wasn't necessary, but this is how the vllm-worker appears to be implemented. I could live with a long start up time because I mostlty want to do batch requests, but if you know how to deploy the vllm template with preloaded model then I'd gladly use that
Madiator2011 (Work)
also serverless is not using /workspace
nielsrolf
nielsrolfOP2mo ago
Ok this is what I got told when I opened a support ticket yesterday, but then I will remove that again
nielsrolf
nielsrolfOP2mo ago
The other thing where it gets stuck on frequently is:
warnings.warn('resource_tracker: There appear to be %d '\n
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown\n
INFO 11-12 14:35:03 weight_utils.py:243] Using model weights format ['*.safetensors']\n
(VllmWorkerProcess pid=229) INFO 11-12 14:35:02 weight_utils.py:243] Using model weights format ['*.safetensors']\n
(VllmWorkerProcess pid=229) INFO 11-12 14:35:02 model_runner.py:1060] Starting to load model cognitivecomputations/dolphin-2.9.1-llama-3-70b...\n
INFO 11-12 14:35:02 model_runner.py:1060] Starting to load model cognitivecomputations/dolphin-2.9.1-llama-3-70b...\n
warnings.warn('resource_tracker: There appear to be %d '\n
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown\n
INFO 11-12 14:35:03 weight_utils.py:243] Using model weights format ['*.safetensors']\n
(VllmWorkerProcess pid=229) INFO 11-12 14:35:02 weight_utils.py:243] Using model weights format ['*.safetensors']\n
(VllmWorkerProcess pid=229) INFO 11-12 14:35:02 model_runner.py:1060] Starting to load model cognitivecomputations/dolphin-2.9.1-llama-3-70b...\n
INFO 11-12 14:35:02 model_runner.py:1060] Starting to load model cognitivecomputations/dolphin-2.9.1-llama-3-70b...\n
Yesterday I was told that this might be due to issues with the model itself, but it now happened with different models and sometimes the models later worked.
Want results from more Discord servers?
Add your server