RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Not using cached worker

I've been running into this problem for several days now. I have a endpoint that runs a forge webui worker with a network volume attached. And as you know forge takes some time to start and only then generates the image. So generally when I send a request to a worker it takes some delay for the start process then generates images. But recently I've run into an issue where there is already a worker running with webui forge started and ready to accept requests but when I submit a new request it completely starts a new worker, which results in huge delay times. My question is, why isn't it using the already available worker which has forge loaded? And no, the requests weren't submitted one after the other so there is no reason to start a new worker...
No description

What are ttft times we should be able to reach?

Of course this depends on token inputs, hardware selection etc. But for the life of me, I cannot get a TTFT of under 2000 ms on serverless. I'm using llama 3.1 7b / gemma / mystral on 48 GB gpu workers. For performance evaluation I use guidellm which test for different throughput (continous, small, large) scenarios. Even with 50 input tokens and 100 output tokens I see 2000-2500 ms ttft. ...

80GB GPUs totally unavailable

My app is totally down because there isn't even 1 GPU available. This has never happened before. Is it me?...

Not able to connect to the local test API server

I am running the container on an EC2 instance. I keep getting errors like: ``` Error handling request...

What methods can I use to reduce cold start times and decrease latency for serverless functions

I understand that adding active workers can reduce cold start issues, but it tends to be costly. I’m looking for a solution that strikes a balance between minimizing cold start times and managing costs. Since users only use our product during limited periods, keeping workers awake all the time isn’t necessary. I’d like to know about possible methods to achieve this balance.

Network volume vs baking in model into docker

I want to run a serverless worker that can get called anywhere from once per hour to 300-400/hour. I want to optimize for cold starts when the occasional request comes in. it runs SDXL, a checkpoint, a few controlnets, etc. About 15-20GB in total. ...

Jobs Stays in In-Progress for forever

Sometimes I never get response when I make a request. It stays in progress and doesn't even show execution time.
No description

How to Get the Progress of the Processing job in serverless ?

When I use status/id, it only return like {delayTime: 873, id: 3e9eb0e4-c11d-4778-8c94-4d045baa99c1-e1, status: IN_PROGRESS, workerId: eluw70apx442ph}, no progress data. I want progress data just like screenshot on serverless console log。Please tell me how to get it in app client....
No description

Rundpod serverless Comfyui template

I couldn’t find any comfyui template on runpod serverless

Why is Runsync returning status response instead of just waiting for image response?

My runsync requests are getting messed up by runpod returning a sync equivalent response (with 'IN_PROGRESS' status and id showing). I need to just return the image, or a failure, not the status using runsync. If I want the status I would just use 'run'. Any idea why this is happening and how to prevent it? For reference, this is for request that generally run for 5-18 seconds to completion. delayTime: 196...

Worker Keeps running after idle timeout

Hi! I have observed that my worker keeps running even there is no request and idle time(60s) has been reached. Also when I make a new request in such a moment my request fails....
No description

May I deploy template ComfyUI with Flux.1 dev one-click to serverless ?emplate

When I click deploy, I only see 'Deploy GPU Pod' , no serverless .
No description

What is the real Serverless price?

In Serverless I have 2 gpu/worker and 1 active worker. The price it shows on the main page is $0.00046/s but in the endpoint edit page it shows $0.00152/s. What is the actual price?

Can't find juggernaut on list of models to download in Comfy UI manager

My workflow is deployed on runpod but i cant find my ckpt in the comfyui manager to download Error Prompt outputs failed validation Efficient Loader:...

comfy

getting message 'throttled waiting for GPU to become available' even though I have 4 endpoints selected with high and medium availability.

Incredibly long startup time when running 70b models via vllm

I have been trying to deploy 70b models as a serverless endpoint and observe start up times of almost 1 hour, if the endpoint becomes available at all. The attached screenshot shows an example of an endpoint that deploys cognitivecomputations/dolphin-2.9.1-llama-3-70b . I find it even weirder that the request ultimately succeeds. Logs and screenshot of the endpoint and template config are attached - if anyone can spot an issue or knows how to deploy 70b models such that they reliably work I would greatly appreciate it. Some other observations: - in support, someone told me that I need to manually set the env BASE_PATH=/workspace, which I am now always doing - I sometimes but not always see this in the logs: AsyncEngineArgs(model='facebook/opt-125m', served_model_name=None, tokenizer='facebook/opt-125m'..., even though I am deploying a completely different model...
No description

Mounting network storage at runtime - serverless

I am running my own docker container and at the moment, I’m using the runpod interface to select network storage which then presents at /runpod-volume This is OK, however, what I am hoping to do (instead) is mount the volume at runtime programmatically. Is this in anyway possible through libraries or API? Basically I would want to list the available volumes, and where the volume exists within the same region as the container / worker, it will mount it....

Serverless fails when workers arent manually set to active

As the title says, my requests to my serverless endpoint are retrying/failing at a much higher frequency when my workers arent set to active. Anyone experienced something like this before?

Chat completion (template) not working with VLLM 0.6.3 + Serverless

I deployed https://huggingface.co/xingyaoww/Qwen2.5-Coder-32B-Instruct-AWQ-128k model through the Serverless UI, setting max model context window to 129024 and quantization to awq. I deploy it using the lastest version of vllm (0.6.3) provided by runpod. I ran into the following errors Client-side...

qwen2.5 vllm openwebui

I have deployed qwen2.5-7b-instruct using the vLLM quick deploy template (0.6.2). But when using openwebui connected by the OpenAI API the runpod workers log these errors: "code": 400, "message": "1 validation error for ChatCompletionRequest\nmax_completion_tokens\n Extra inputs are not permitted [type=extra_forbidden, input_value=50, input_type=int]\n For further information visit https://errors.pydantic.dev/2.9/v/extra_forbidden", "object": "error", "param": null,...
Next