Stuck vLLM startup with 100% GPU utilization
Twice now today I've deployed a new vLLM endpoint using the "Quick Deploy" "Serverless vLLM" option at: https://www.runpod.io/console/serverless only to have the worker stuck after launching the vLLM process and before reaching the weights downloading. It never reaches the state of actually downloading the HF model and loading it into vLLM.
* The image I've used is Qwen/Qwen2.5-72B-Instruct
* The problematic machines have all been A6000.
* Only a single worker configured with 4 x 48GB GPUs was set in the template configuration, in order to make the problem easier to track down (a single pod and a single machine).
I have a current worker stuck in this state presently, and it has the id: wxug1x04v59mxu
I'm going to terminate it since it just costs me money without providing any value, but if runpod has the ability to check logs after the fact (e.g. some ELK stack or the like), I hope they can pin-point the issue using that ID. If not, let me know and next time this happens I'll let you ping you so you can live-trouble shoot. Just let me know who to ping in that case.
Attached is the complete log from the worker.
0 Replies