RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Feb 20 - Serverless Issues Mega-Thread

Many people seem to be running into the following issue: Workers are "running" but they're not working on any requests, and requests just sit there for 10m+ queued up without anything happening. I think there is an issue with how the requests are getting assigned to the workers: there a number of idling workers, and there are a number of queued requests, and they both stay in that state for many minutes without any requests getting picked up by workers! ...

Default Execution time out

In the docs it say that all serverless endpoints have a 10 min default execution time out. We have had few instances that the job is stuck in processing for hours. Are the docs incorrect and we need to set the execution timeout manually?

Gpu hosting with API

Hi there. I need gpu hosting 24/7 that I can scale by bringing on more instances as needed than take offline when not using an api. I’m looking at novita but I don’t like their serverless pricing for 4090. Is this something you offer? Thx

Job Stuck in Queue Eventhough worker is ready

I am using serverless endpoint with H100 but I am experiencing high queue time .If you send a single request to runpod enpoint you may get 2 seconds delay time and on same 2nd request you will get queue time of 7 seconds which should not happend.I think they should optimize their queue and worker communication codes ist run: 3 seconds 2nd run: 15.84 seconds...

us-tx3 region cannot spin up new worker

My endpoint is deployed at us-tx3 region. When I submit new request to the endpoint, it spin up new worker in the console. However, there is no log on the console and no response from ssh. Any thought?

Builds are slower than ever & not showing up Logs at all

After the announced improvement of 2-3x on build times, we get 2-3x slow builds and received 0 logs since then. Please raise attention here.
No description

Workers stuck at initializing

For the past couple of days I've been unable to get any workers. I followed the AI advice (https://discord.com/channels/912829806415085598/1341152205549338804/1341152314190204940), and even created a new endpoint with the hope that selecting -almost - all options for everything would help. No workers at all have shown in the new endpoint, and the old is still stuck on initializing as shown in the image for over 6 hours now. What's the problem here?
No description

Avoiding hallucinations/repetitions when using the faster whisper worker ?

worker: https://github.com/runpod-workers/worker-faster_whisper Hi everyone, as the title suggests, I'm encountering an issue where the transcription occasionally might repeat the same word/sentence. When this occurs it ruins the entire transcription from the point where it happens....
No description

Serverless Docker tutorial or sample

Hi, where can I get a dockerfile deployment tutorial? I'm interested of deploying a custom docker image on a serverless...

Baking model into Dockerimage

Hello, im trying to bake or rather, downlaod the model via vllm directly while building so the image contains the model. I havent found any kind of simple "vllm download" command sadly. The onlny was is either by running vllm and afterwards adding the file to the image which would be too big to host on my registry or let runpod serverless build the image for me with its doing it while building

Facing Read timeout error in faster whisper

When i pass the link, sometimes its getting failed with the error read timed out. 8od67s8p9ijjao[error]Captured Handler Exception 8od67s8p9ijjao[info]Failed to download https://gen7.icreatelabs.com/generate/download?mp3=azhoM2gzaTljN2gxZzFnMWYyaDN5N24yeDdvNGIxejB5N3owZTF4N3A2ejB0MXg3ajl5N2cxdDFsMHYydjJ6MGIxZzF4OWwweTdqOWEzZzFxMGsxdTN5NnczaDNzM2w4YTN5N2Ex: HTTPSConnectionPool(host='ytdl.vreden.web.id', port=443): Read timed out. (read timeout=5)\n 8od67s8p9ijjao[info]_common.py :120 2025-02-15 12:10:08,546 Giving up download_file(...) after 3 tries (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='ytdl.vreden.web.id', port=443): Read timed out. (read timeout=5))\n 8od67s8p9ijjao[info]_common.py :105 2025-02-15 12:10:01,305 Backing off download_file(...) for 0.1s (requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='ytdl.vreden.web.id', port=443): Read timed out. (read timeout=5))\n...

Seems like my serverless instance is running with no requests being processed

Seems like my serverless instance is running with no requests being processed. There are no active workers or anything keeping the instance active based on the logs or requests tabs but it's been running
No description

Flashboot not working after a while

Hello, i wanted to ask why sometimes the Flashboot works when i have a worker in idle and sometimes it doesnt. It seems when a certian amount of time has passed that it simply is doing a cold start again. Is this normal? Is there anything to prevent this?...

Why isn't RunPod reliable?

I have 3 workers setup. When I submit a request sometimes it sits in the queue for 5+ minutes before processing begins. I can see a single worker running while the rest idle, but the work isn't getting done. This isn't suitable for production if it takes 5+ minutes to kick off a job. Am I doing something wrong or does this service just not work well?
No description

serverless - lora from network storage

Hi, I have a flux generation serverless setup that's working pretty well. I bake all models into the docker image so even though the docker image size is pretty large, cold start is pretty reasonable and generations are fast enough. Now the issue arises with a new workflow where I will train more lora and need to set these available to the serverless workflow....

stuck in cue

i can seems to get 3.2 11 b vision to work on serveless settings, i am using the h100 sxm gpu with 2 active workers working, i have 5 total workers, but seems like most of then get unhealthy all the time. when i try tyo send a request, it is just stuck in que and never finished the job, ive had it running for 5 minutes before. what can i do?

Costs

I ran two serverless jobs. Each took about 30 seconds compute on a 16gb machine. There was a delay, and a cold start time for both. The total billing each was about 2 cents, in total 4 cents. ...

hey we have serverless endpoints but we have no workers for more than 12 hours now !

we have our serverless end points running but we have 0 workers joining what is happening ? is this normal ?

[Solved] EU-CZ Datacenter not visible in UI

I know the data centre is currently down. But the news about it being updated made me realize that I haven't seen this region both in pods and serverless for many months already. Which is quite unfortunate since I am based in Czechia. I thought RunPod removed it completely but now I see it's not. So why don't I see it at all?

Does Runpod serverless GPU's support NVIDIA MIG

Hello! I was wondering if anyone had any experience with setting up NVIDIA MIG (GPU partitioning) on runpod serverless? I'm currently trying to deploy a ~370 million parameters model onto serverless inference and we were trying to see if it would be possible to set up GPU partitioning on 1 worker to try and work around the serverless worker limitations. If any one has experience or even knows if Runpod support this would be much appreciated! thank you!
Next