RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Runpod requests fail with 500

also when i try to open my endpoint in UI, it redirects to 404. I didn't change anything....

Upgrade faster-whisper version for quick deploy

Hey guys, can you please upgrade faster-whisper pip dependency version of quick deploy as the current one 0.0.10 does not support the Turbo model. Thanks!

LoRA path in vLLM serverless template

I want to attach a custom LoRA adapter to Llama-3.1-70B model. Usually while using vLLM, after the --enable-lora we also specify the --lora-modules name=lora_adapter_path, something like this. But in the template, it only gives option to enable LoRA, where do I add the path of the LoRA adapter?

Wish to split model files with docker, but it slows down significantly when using storage

I want to split model files with docker. The model files are getting bloated, I tried to store the model files to storage but found that the inference time grows a lot, is there a way around this?...

Intermittent timeouts on requests

I have a custom docker image serverless endpoint. I am sending a payload with the python package endpoint.run_sync(payload, timeout=60). I currently have 0 active workers. I can typically send the first request. After completion, if I send a following request before that worker times out, it will often timeout without ever logging that the main function started. Basically, it seems like the message never gets in the RunPod Queue. What could be causing this behavior? How can I avoid it (or debug it)?
Logs are attached - this case is 2 successful requests, then a third request just times out - it seems like the request never gets to the queue (no logs)....
No description

"Failed to return job results. | Connection timeout to host https://api.runpod.ai/v2/91gr..."

I keep having these errors on my endpoints. It happens most of the time for "high-res" images (4K) but they're JPEG and max 2MB. Runpod serverless pods have significantly deteriorated these last days for me....

Why when I try to post it already tags it Solved?

Why when I try to post it already tags it Solved?

HF Cache

Hey I got this email from you guys
Popular Hugging Face models have super fast cold-start times now
We know lots of our developers love working with Hugging Face models. So we decided to cache them on our GPU servers and network volumes.
Popular Hugging Face models have super fast cold-start times now
We know lots of our developers love working with Hugging Face models. So we decided to cache them on our GPU servers and network volumes.
...

GPU Availability Issue on RunPod – Need Assistance

Hi everyone, I’m currently facing an issue with GPU availability for my ComfyUI endpoint (id: kw9mnv7sw8wecj) on RunPod. When trying to configure the worker, all GPU options show as “Unavailable”, including 16GB, 24GB, 48GB, and 80GB configurations (as shown in the attached screenshot). This is significantly impacting my workflow and the ability to deliver results to my clients since I rely on timely image generation....
No description

job timed out after 1 retries

Been seeing this a ton on my endpoint today resulting in being unable to return images. response_text: "{"delayTime":33917,"error":"job timed out after 1 retries","executionTime":31381,"id":"sync-80dbbd6d-309c-491f-a5d0-2bd79df9c386-e1","retries":1,"status":"FAILED","workerId":"a42ftdfxrn1zhx"} ...

Unable to fetch docker images

During worker initialization I am seeing errors such as: error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded 2024-11-18T18:10:47Z error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)...

Failed to get job. - 404 Not Found

the endpoint is receiving the jobs but errors out (worker logs below): ``` 2024-11-18T13:50:42.510726100Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientResponseError | Error Message: 404, message='Not Found', url='https://api.runpod.ai/v2/ihv956xmtmq9t3/job-take/etbm9mpkgsl6hd?gpu=NVIDIA+GeForce+RTX+3090&job_in_progress=0'", "level": "ERROR"} 2024-11-18T13:50:42.848129909Z {"requestId": null, "message": "Failed to get job. | Error Type: ClientResponseError | Error Message: 404, message='Not Found', url='https://api.runpod.ai/v2/ihv956xmtmq9t3/job-take/etbm9mpkgsl6hd?gpu=NVIDIA+GeForce+RTX+3090&job_in_progress=0'", "level": "ERROR"}...

vLLM override open ai served model name

Overriding the served model name on the vllm serverless pod doesn't seem to take effect. Configuring a new endpoint through the explore page on runpod's interface creates a worker with the env variable OPENAI_SERVED_MODEL_NAME_OVERRIDE but the name of the model on the openai endpoint is still hf_repo/model name. The logs show : engine.py: AsyncEngineArgs(model='hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4', served_model_name=None... and the endpoint returns Error with model object='error' message='The model 'model_name' does not exist.' type='NotFoundError' param=None code=404 ...

Not using cached worker

I've been running into this problem for several days now. I have a endpoint that runs a forge webui worker with a network volume attached. And as you know forge takes some time to start and only then generates the image. So generally when I send a request to a worker it takes some delay for the start process then generates images. But recently I've run into an issue where there is already a worker running with webui forge started and ready to accept requests but when I submit a new request it completely starts a new worker, which results in huge delay times. My question is, why isn't it using the already available worker which has forge loaded? And no, the requests weren't submitted one after the other so there is no reason to start a new worker...
No description

What are ttft times we should be able to reach?

Of course this depends on token inputs, hardware selection etc. But for the life of me, I cannot get a TTFT of under 2000 ms on serverless. I'm using llama 3.1 7b / gemma / mystral on 48 GB gpu workers. For performance evaluation I use guidellm which test for different throughput (continous, small, large) scenarios. Even with 50 input tokens and 100 output tokens I see 2000-2500 ms ttft. ...

80GB GPUs totally unavailable

My app is totally down because there isn't even 1 GPU available. This has never happened before. Is it me?...

Not able to connect to the local test API server

I am running the container on an EC2 instance. I keep getting errors like: ``` Error handling request...

What methods can I use to reduce cold start times and decrease latency for serverless functions

I understand that adding active workers can reduce cold start issues, but it tends to be costly. I’m looking for a solution that strikes a balance between minimizing cold start times and managing costs. Since users only use our product during limited periods, keeping workers awake all the time isn’t necessary. I’d like to know about possible methods to achieve this balance.

Network volume vs baking in model into docker

I want to run a serverless worker that can get called anywhere from once per hour to 300-400/hour. I want to optimize for cold starts when the occasional request comes in. it runs SDXL, a checkpoint, a few controlnets, etc. About 15-20GB in total. ...
Next