jojje
jojje
RRunPod
Created by jojje on 1/26/2025 in #⚡|serverless
Stuck vLLM startup with 100% GPU utilization
Twice now today I've deployed a new vLLM endpoint using the "Quick Deploy" "Serverless vLLM" option at: https://www.runpod.io/console/serverless only to have the worker stuck after launching the vLLM process and before reaching the weights downloading. It never reaches the state of actually downloading the HF model and loading it into vLLM. * The image I've used is Qwen/Qwen2.5-72B-Instruct * The problematic machines have all been A6000. * Only a single worker configured with 4 x 48GB GPUs was set in the template configuration, in order to make the problem easier to track down (a single pod and a single machine). I have a current worker stuck in this state presently, and it has the id: wxug1x04v59mxu I'm going to terminate it since it just costs me money without providing any value, but if runpod has the ability to check logs after the fact (e.g. some ELK stack or the like), I hope they can pin-point the issue using that ID. If not, let me know and next time this happens I'll let you ping you so you can live-trouble shoot. Just let me know who to ping in that case. Attached is the complete log from the worker.
2 replies
RRunPod
Created by jojje on 10/4/2024 in #⛅|pods
Runpod API documentation
Is the runpod API documented somewhere? I've failed to find anything about it, and have had to resort to reverse engineering the webUI backend interaction (graphql mostly), and infer what the API might be by looking at how runpodctl makes API calls. Would be great to have the complete API documented, as it would allow creating much better tooling, saving users time and also increasing the overall value of the runpod platform (win-win).
4 replies
RRunPod
Created by jojje on 10/4/2024 in #⛅|pods
Is runpodctl abandonware?
I notice there's a lot of useful PRs for runpodctl, but zero comments nor activities on those from runpod's side. So I'm wondering if there is at least someone at runpod keeping an eye on the runpodctl github project, or if it's left to rot. Would be useful to know so the community knows whether it's time to fork it or create a replacement.
6 replies