RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Stuck vLLM startup with 100% GPU utilization

Twice now today I've deployed a new vLLM endpoint using the "Quick Deploy" "Serverless vLLM" option at: https://www.runpod.io/console/serverless only to have the worker stuck after launching the vLLM process and before reaching the weights downloading. It never reaches the state of actually downloading the HF model and loading it into vLLM. * The image I've used is Qwen/Qwen2.5-72B-Instruct * The problematic machines have all been A6000. * Only a single worker configured with 4 x 48GB GPUs was set in the template configuration, in order to make the problem easier to track down (a single pod and a single machine)....

How to respond to the requests at https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1

the openai input is in the job input, I extracted it and processes the request . when send the the response with yield or return it recived could you take a look at this [https://github.com/mohamednaji7/runpod-workers-scripts/blob/main/empty_test/test%20copy%203.py] ...

worker-vllm not working with beam search

Hi, I found another bug in your worker-vllm. Beam search is not supported even though your README says it is. This time around it's length_penalty not being accepted. Can you please work on a fix for beam search? Thanks!

All GPU unavailable

I just started using RunPod. Yesterday, I created my first serverless endpoint and submitted a job, but I didn't receive a response. When I investigated the issue, I found that all GPUs were unavailable. The situation hasn't changed since then. Could you tell me what I should do?
No description

/runsync returns "Pending" response

Hi, I've send a request to my /runsync endpoint and it returned a {job... status:"pending"} response. Can someone clarify when this happens? When the request is taking too long to complete?

Kicked Worker

Is there a webhhok for the event that a worker is kicked? Or is there only the /health call, where we need to track the change of reqeusts since the last /health call (tracking the change of failed request)

Possible to access ComfyUI interface in serverless to fix custom nodes requirements?

Hi RunPod addicts! I have a functional ComfyUI install running in a Pod that I want to replicate serverless. My comfyUI install is made for a specific workflow requiring 18 custom nodes. ...

How to truly see the status of an endpoint worker?

I'm trying out the vllm serverless endpoints and am running in to a lot of trouble. I was able to get responses from a running worker for a little while, then the worker went idle (as expected) and I tried sending a fresh request. That request has been stuck for minutes now and there's no sign that the worker is even starting up. The runpod UI says the worker is "running" but there's nothing in the logs for the past 9 minutes (the last log line was from the previous worker exiting). My latest requests have been stuck for about 7 minutes each. How do I see the status of an endpoint worker if there's nothing in the logs and nothing in the telemetry? What does "running" mean if there's no logs or telemetry?...

How do I calculate the cost of my last execution on a serverless GPU?

For example if I have one GPU with cost $0.00016 and the other ones with $0.00019. How do I know which serverless GPU actually picked this GPU after the request has been completed? Also, is there an easy way to just get the cost of the last runsync request instead of manually calculating it?

Serverless deepseek-ai/DeepSeek-R1 setup?

How can I configure a serverless end point for deepseek-ai/DeepSeek-R1?

what is the best way to access more gpus a100 and h100

flux is of 25 gb , if i download that model in network volume, then i can only access that region gpus only , and I can see everytime , a100 and h100 gpus are in LOW in all the regions . If i download the model flux in the container itself while building the docker image, instead of network volume, then it has to download 25 size of docker image everytime for the new pod could anyone please help me with this...

Guidance on Mitigating Cold Start Delays in Serverless Inference

We are experiencing delays during the cold starts of our serverless server used for inference of a machine learning model (Whisper). The main suspected cause is the download of model weights (custom model trained by us), which are fetched via the Hugging Face package within the Python code. We are exploring possible solutions and need guidance on feasibility and best practices. Additional Context: - The inference server currently fetches model weights dynamically from Hugging Face during initialization, leading to delays. - The serverless platform is being used for inference as part of a production system requiring low latency....
No description

A40 Throttled very regularly!

I have a serverless endpoint with 3GPU that is being fully throttled very regularly. It is fully unusable for long minutes, see screenshot, request are being queued forever. It has been the case yesterday and today, it's far too unreliable......
No description

SSH info via cli

Absence of ssh access info via CLI (only in the case the server does have an exposed TCP port). It doesn’t have the url ssh access in ‘runpodctl get pod’

Can not get a single endpoint to start

New to runpod, but not new to LLM's and running our own inference. So far, every single vLLM Template or vLLM worker that I have set up is failing. I use only the most basic settings, and have tried across a wide range of GPU types, with a variety of models (including the 'Quickstart' templates). Not a single worker has created an endpoint that works or runs the openai API endpoint. I get 'Initializing' and 'Running', but then no response at all to any request. Logs don't seem to have any information that help me diagnose the issue. Might well be that I am missing something silly, or that there is something amiss, I'm just not sure - could do with some assistance (and some better documentation) if there is someone from runpod that can help?...

All 16GB VRAM workers are throttled in EU-RO-1

I have a problem in EU-RO-1: all worked are constantly in throttled state (xz94qta313qvxe, gu1belntnqrflq and so on)...
No description

worker-vllm: Always stops after 60 seconds of streaming

Serverless is giving me this weird issue where the OpenAI stream stops after 60 seconds, but the request keeps running in the vLLM worker deployed. This results in not getting all the outputs, wasting the compute resources. The reason I want it going longer than 60 seconds is that I have a use-case for generating very long outputs. I have needed to resort to directly querying api.runpod.ai/v2. This has benefits of being able to get the job_id and do more things, but I would like to do this with the OpenAI API....

I want to deploy a serverless endpoint with using Unsloth

Unsloth do bnb qunatization and it's better loading their model, I think. I did training using Unsloth on a pod; I want to deploy it on a serverless endpoint and get the OpenIA client API

--trust-remote-code

I tried to install deepseek v3 on serverless vllm showing this "Uncaught exception | <class 'RuntimeError'>; Failed to load the model config. If the model is a custom model not yet available in the HuggingFace transformers library, consider setting trust_remote_code=True in LLM or using the --trust-remote-code flag in the CLI.; <traceback object at 0x7fecd5a12700>;"...
Next