RunPod•5mo ago

Incredibly long startup time when running 70b models via vllm

I have been trying to deploy 70b models as a serverless endpoint and observe start up times of almost 1 hour, if the endpoint becomes available at all. The attached screenshot shows an example of an endpoint that deploys cognitivecomputations/dolphin-2.9.1-llama-3-70b . I find it even weirder that the request ultimately succeeds. Logs and screenshot of the endpoint and template config are attached - if anyone can spot an issue or knows how to deploy 70b models such that they reliably work I would greatly appreciate it. Some other observations: - in support, someone told me that I need to manually set the env BASE_PATH=/workspace, which I am now always doing - I sometimes but not always see this in the logs: AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'..., even though I am deploying a completely different model - I sometimes but not always get issues when I don't specify the chat template

[rank0]: TypeError: expected str, bytes or os.PathLike object, not dict\n
2024-11-12 12:59:15.351
[rank0]: with open(chat_template, "r") as f:\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/chat_utils.py", line 335, in load_chat_template\n
[rank0]: self.chat_template = load_chat_template(chat_template)\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 73, in __init__\n

[rank0]: TypeError: expected str, bytes or os.PathLike object, not dict\n
2024-11-12 12:59:15.351
[rank0]: with open(chat_template, "r") as f:\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/chat_utils.py", line 335, in load_chat_template\n
[rank0]: self.chat_template = load_chat_template(chat_template)\n
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 73, in __init__\n

logs-dolphin-2.9.1-l...

10 Replies

nielsrolfOP•5mo ago

I think my main issue is the same as https://github.com/runpod-workers/worker-vllm/issues/112

GitHub

Randomly the machine get stuck on loading model · Issue #112 · runp...

Hi, as the title suggests completely random, the machine gets stuck on Using model weights format ['*.safetensors'] and I have to manually terminate the worker and restart it. Do you have a...

Poddy•5mo ago

@nielsrolf

Escalated To Zendesk

The thread has been escalated to Zendesk!

nerdylive•5mo ago

I'll try to open a ticket, you can check from that button

nielsrolfOP•5mo ago

Thanks, it now says Ticket created

Madiator2011 (Work)•5mo ago

usually you dont want download models on sending request

nielsrolfOP•5mo ago

Yes it would indeed be better if that wasn't necessary, but this is how the vllm-worker appears to be implemented. I could live with a long start up time because I mostlty want to do batch requests, but if you know how to deploy the vllm template with preloaded model then I'd gladly use that

Madiator2011 (Work)•5mo ago

also serverless is not using /workspace

nielsrolfOP•5mo ago

Ok this is what I got told when I opened a support ticket yesterday, but then I will remove that again

nielsrolfOP•5mo ago

message.txt

nielsrolfOP•5mo ago

The other thing where it gets stuck on frequently is:

warnings.warn('resource_tracker: There appear to be %d '\n
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown\n
INFO 11-12 14:35:03 weight_utils.py:243] Using model weights format ['*.safetensors']\n
[1;36m(VllmWorkerProcess pid=229)[0;0m INFO 11-12 14:35:02 weight_utils.py:243] Using model weights format ['*.safetensors']\n
[1;36m(VllmWorkerProcess pid=229)[0;0m INFO 11-12 14:35:02 model_runner.py:1060] Starting to load model cognitivecomputations/dolphin-2.9.1-llama-3-70b...\n
INFO 11-12 14:35:02 model_runner.py:1060] Starting to load model cognitivecomputations/dolphin-2.9.1-llama-3-70b...\n

warnings.warn('resource_tracker: There appear to be %d '\n
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown\n
INFO 11-12 14:35:03 weight_utils.py:243] Using model weights format ['*.safetensors']\n
[1;36m(VllmWorkerProcess pid=229)[0;0m INFO 11-12 14:35:02 weight_utils.py:243] Using model weights format ['*.safetensors']\n
[1;36m(VllmWorkerProcess pid=229)[0;0m INFO 11-12 14:35:02 model_runner.py:1060] Starting to load model cognitivecomputations/dolphin-2.9.1-llama-3-70b...\n
INFO 11-12 14:35:02 model_runner.py:1060] Starting to load model cognitivecomputations/dolphin-2.9.1-llama-3-70b...\n

Yesterday I was told that this might be due to issues with the model itself, but it now happened with different models and sometimes the models later worked.

Gaming

Programming

Incredibly long startup time when running 70b models via vllm

Did you find this page helpful?