Topics

RunPod•8mo ago

Terminating local vLLM process while loading safetensor checkpoints

I started using Llama 3.1 70B as a serverless function recently. I got it to work, and the setup is rather simple: 2 x A100 GPUs 200GB Network volume 200GB Container storage Model https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct The serverless function started successfully a few times, but(!) there are two recurring issues, and the first probably leads to the second. Issue #1: Loading the safetensors checkpoint is almost always very slow (> 15s/it) Issue #2: The container is terminated before it loads all the checkpoints, and is basically restarted in a loop for no given reason

2024-09-11  14:08:47.002 | info | 52g6dgmu0eudyb | [1;36m(VllmWorkerProcess pid=148)[0;0m INFO 09-11 13:08:37 utils.py:841] Found nccl from library libnccl.so.2\n
2024-09-11  14:08:47.002 | info | 52g6dgmu0eudyb | [1;36m(VllmWorkerProcess pid=148)[0;0m INFO 09-11 13:08:37 multiproc_worker_utils.py:215] Worker ready; awaiting tasks\n
2024-09-11  14:08:01.350 | info | 52g6dgmu0eudyb | \rLoading safetensors checkpoint shards:  80% Completed | 24/30 [06:17<01:38, 16.35s/it]\n
2024-09-11  14:07:59.935 | info | 52g6dgmu0eudyb | INFO 09-11 13:07:59 multiproc_worker_utils.py:136] Terminating local vLLM worker processes\n
2024-09-11  14:07:42.177 | info | 52g6dgmu0eudyb | \rLoading safetensors checkpoint shards:  77% Completed | 23/30 [05:57<01:46, 15.15s/it]\n
2024-09-11  14:07:28.855 | info | 52g6dgmu0eudyb | \rLoading safetensors checkpoint shards:  73% Completed | 22/30 [05:44<02:07, 15.93s/it]\n

2024-09-11  14:08:47.002 | info | 52g6dgmu0eudyb | [1;36m(VllmWorkerProcess pid=148)[0;0m INFO 09-11 13:08:37 utils.py:841] Found nccl from library libnccl.so.2\n
2024-09-11  14:08:47.002 | info | 52g6dgmu0eudyb | [1;36m(VllmWorkerProcess pid=148)[0;0m INFO 09-11 13:08:37 multiproc_worker_utils.py:215] Worker ready; awaiting tasks\n
2024-09-11  14:08:01.350 | info | 52g6dgmu0eudyb | \rLoading safetensors checkpoint shards:  80% Completed | 24/30 [06:17<01:38, 16.35s/it]\n
2024-09-11  14:07:59.935 | info | 52g6dgmu0eudyb | INFO 09-11 13:07:59 multiproc_worker_utils.py:136] Terminating local vLLM worker processes\n
2024-09-11  14:07:42.177 | info | 52g6dgmu0eudyb | \rLoading safetensors checkpoint shards:  77% Completed | 23/30 [05:57<01:46, 15.15s/it]\n
2024-09-11  14:07:28.855 | info | 52g6dgmu0eudyb | \rLoading safetensors checkpoint shards:  73% Completed | 22/30 [05:44<02:07, 15.93s/it]\n

Any ideas why that could happen?

meta-llama/Meta-Llama-3.1-70B-Instruct · Hugging Face

llama-terminating.tx...

9 Replies

octopus•5mo ago

@ditti were you able to find a solution to this?

yhlong00000•5mo ago

It might caused by network volume.

octopus•5mo ago

you mean cuz we have network volume attached?

yhlong00000•5mo ago

If you have your model stored in network volume, it might be slow when cold start and occasionally might cause it failed to start the worker

Titanium-Monkey•5mo ago

Was able to get this to work locally without network volume - did not test with it. But I had to use 4x48GB GPU's as it kept crashing on 2x80GB GPUs

No description

No description

No description

Titanium-Monkey•5mo ago

No description

Jason•5mo ago

Soo is there a way around if the model is large? We couldn't just build an image in it with llm and upload it to registry

yhlong00000•5mo ago

we will release a hugging face model cache soon and that should help in this case.

Jason•5mo ago

Ooh ya nice

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

17KMembers

View on Discord

Did you find this page helpful?