RunPod•2w ago

why the hell are my delay times so high and im bearing all the costs??

yesterday everything was working fine, delay times were a couple seconds. but now the delay times are getting ridiculous and IM being charged for the delay on top of the execution??

51 Replies

Dj•2w ago

We're currently resolving an incident that affects serverless job times. I will update you when I have news about extra spend as a result of the outage

mambo no. 5OP•2w ago

will we be reimbursed for the unnecessary extra spending in delay times? why are we even being charged for delay times? i was always under the impression im paying for the execution time

Dj•2w ago

I don't know for certain yet but once engineering talk of resolution in the thread is over I'll work on it. Worst case I'm capable of giving refunds myself, but I'd prefer an automated solution too 😅

mambo no. 5OP•2w ago

okay please provide an update here when you have more details

mambo no. 5OP•2w ago

if i load in $100 in credits and spin up 10 workers, what happens when my balance falls below $100? @Dj or what if i want to deploy a new endpoint? will i still be able to deploy 10 workers?

Dj•2w ago

Yes, it's just a soft check at the time of registration once you press upgrade you're fine.

AdamOH•2w ago

We have a stable diffusion endpoint that has failed to boot up since the outage last night, even though the gpus are continually trying to boot up and start serving requests we're being charged for that gpu time even though its broken due to the outage that we're still trying to debug. this has been a major hit to our business!

Jason•2w ago

Hey runpod support can check that have you made any support request?

mambo no. 5OP•2w ago

exactly! we're having the same issue over here as well. i believe we deserve some form of reimbursement! it doesnt make sense for us to bear the costs for runpod's failure :( @Dj

Dj•2w ago

Hey, can you share your endpoint id with me so I can look into this? Hey, can you also share an endpoint id? I want to get people reimbursed but I can't do this until I know if I need to be doing it manually or if we're issuing them automatically. I'm following up with our engineering team now, but they're going to want to see affected user ids as well

mambo no. 5OP•2w ago

here's the id, u can take a look at earlier requests yesterday to find the problematic ones. some of them took 12 minutes+ when it's usually a couple seconds. proposed_emerald_fly

Jason•2w ago

No it's the name that you sent, Id looks like a random characters it's in your /run url

mambo no. 5OP•2w ago

oh this should be the one: u7hn1oucmnkkc5

Jason•2w ago

Yep seems that, now let dj check that

Dj•2w ago

Everything I see seems to be normal behavior for your workload, but I can only see the lifecycle of each worker (incoming request, pod started, job finished, pod stopped). You should be able to email support for help with receiving reimbursement.

mambo no. 5OP•2w ago

this is from yesterday when i made this thread. are you telling me the 5-12 minute delay times are normal? if so i think we're gonna have to re consider hosting on runpod

Jason•2w ago

Can you check your endpoint logs, who knows you can see what's wrong with those worker

riverfog7•2w ago

depends on your model tho and cold start / fast boot

mambo no. 5OP•2w ago

Mate, look at the more recent requests in the pic. Usual delay time is 4-5s. Never over a minute and nothing close to 12 minutes

riverfog7•2w ago

what's ur model

mambo no. 5OP•2w ago

It just running a comfyui workflow for wan video gen I’m telling you it’s not about the model. I’ve ran the exact same workflow over the past week and never seen anything remotely close to 12 minutes I’m still handling requests today and the delay time is nowhere near a minute even

riverfog7•2w ago

maybe logs will help debugging?

Jason•2w ago

That's why I'm telling you to only check the logs if possible..

riverfog7•2w ago

yeah

Jason•2w ago

Especially in that time, and that specific worker

riverfog7•2w ago

from a dev's perspective only info they(i mean people here) have is 1. pods are sometimes taking longer to load done you can't debug with that

Jason•2w ago

But as dj said you can create a support ticket or email support for reimbursement request

riverfog7•2w ago

yeah but if you want to debug together we need logs

mambo no. 5OP•2w ago

How do I get the logs for those ones? It’s disappeared from the requests tab

Jason•2w ago

Is there a logs tab? Not in requests tab

mambo no. 5OP•2w ago

I can’t find them anymore, they’ve been buried under multiple other requests :( Anyways the bottom line is, will we all be getting reimbursed or not?

Jason•2w ago

I think the best way to get that answer is ask in a support request / ticket I'm just trying to see what's the problem from the logs if possible, it's fine if you cant find them anymord

riverfog7•2w ago

my thought about the delay times are: 1. the 4 to 5 sec delay time you had b4 was a result of runpod's fast boot feature which essentially keeps the model loaded in VRAM. 2. the 5 min delay time was probably caused by cold start having 1. The idle timeout in serverless settings 2. the image u r using 3. the model 4. the interval you send the requests 5. hopefully the logs if possible might help debugging possible causes of high delay are: 1. idle timeout is too low and it causes workers to do a cold boot every time (or you send requests one at a time.) ~~2. if u r using a non-official image that may not be cached at the host and cause high boot time~~ 3. runpod's network volume has speed issues 4. CUDA Memory leak (the worker could die after processing one request)

Jason•2w ago

Images won't be re downloaded as long as your worker stays idle, and if your worker is initializing and if counts as delay time it means your endpoint is new So feel free to eliminate That one

mambo no. 5OP•2w ago

yeah i have an email ticket open with them but they haven't been very vocal with their responses

Jason•2w ago

What does vocal means?

mambo no. 5OP•2w ago

they didn't provide any meaningful information other than just saying they have fixed the outage @Dj can you confirm if we are even supposed to pay for delay times or just execution times? there is no information on this at all if i have a request which has a delay of 2 minutes and execution of 2 minutes, do i pay for 2 or 4 minutes?

Dj•2w ago

You're not paying for delay time, delay time is stuff like how long it takes the image to download and start, that's on us, execution time is how long it takes the model to load and actually do the thing. For that example, 2m

Jason•2w ago

Delay time can be charged too, thing is you will be charged when worker is running

mambo no. 5OP•2w ago

?? who is right here do we need to bring the ceo in

Jason•2w ago

What is charged is only when your worker is running It can be delay time or execution time

mambo no. 5OP•2w ago

@Dj ^ is that true or not? if it's true then how do i even quantify how much im paying

Jason•2w ago

Wait now model loading is on execution time?

Dj•2w ago

Candidly I'm not the best source of information on this, but my understanding is it depends on how your worker is setup. It should be delay time to load a model but I'm pretty sure you can do it "wrong" and load your model on request. Technically nothing stops you from shooting yourself in the foot, any code inside the handler function which is literally responsible for responding to your request is run time.

Jason•2w ago

Hmm did you ask for reimbursement?

Dj•2w ago

If you want me to tak ea look at your template and help you understand your delay time I can skim it over now, but it's 2am on a weekend so providing full support is slightly out of my scope at this time. I'm happy to answer questions, etc but fixing a template for you is something I'd rather do on Monday 😛

Jason•2w ago

Yeah, usually it's delay time, but usually and from the docs, they recommend to put model loading outside handler

mambo no. 5OP•2w ago

yeah i get where you're at rn haha it's 3am for me yep we'll see what they respond on monday]

Jason•2w ago

Ohh okay

Dj•2w ago

Support was directed to provide reimbursement for the length of the outage iirc (27 minutes) and it was confirmed Pods were unaffected, only serverless users (like you!)

Jason•2w ago

Basically whatever happens before you call runpod.Serverless.Start() is the delay time that is charged (because worker is running already) From your docker file's entrypoint or cmd

Gaming

Programming

why the hell are my delay times so high and im bearing all the costs??

Did you find this page helpful?