RunPod•14mo ago

Unacceptably high failed jobs suddenly

Suddenly almost 20% of my serverless jobs failed. I have never had this issue until yesterday. This is is completely UNACCEPTABLE that I am being charged for this immense fuck up and that my customers are being impacted. This needs to be resolved IMMEDIATELY and I demand a refund for this!

26 Replies

ashleykOP•14mo ago

@flash-singh @Zeen @JM This is completely UNACCEPTABLE and needs to be RESOLVED IMMEDIATELY and I demand a refund. Must be some infrastructure issue because I don't even have any error logs for any of my failed jobs. Also extremely suspicious that the increase in failed jobs coincides with less workers being throttled.

ashleykOP•14mo ago

Baran•14mo ago

Same here. Not 20% but I still way more than before

ashleykOP•14mo ago

Jobs should only fail if there is an error in the severless handler code, which never happened in my case. I also don't know how this issue is supposed to be debugged when there are no error logs for any of the failed jobs. Looks like most of them failed due to executionTimeout exceeded. My jobs shouldn't take more than 5 minutes to execute, there is something wrong with the workers. My jobs take 3 minutes max to execute so something is seriously wrong here, and I have been running this endpoint in various different regions for several months and never had this issue until now. I would also expect to see these executionTimeout errors in the logs for my endpoint, but they aren't in the logs.

ashleykOP•14mo ago

Also I don't know why my IP was rate limited on Saturday, this has never happened before and I wasn't even sending that many requests.

ashleykOP•14mo ago

This is making serverless more and more unusable by the day. Each of those 2 terminal windows has a different public IP as well, so there is really no reason why I should have been rate limited.

flash-singh•14mo ago

is that 429 error in your handler code or somewhere else?

ashleykOP•14mo ago

Checking the staus of my jobs mostly. Also trying to create a new job.

flash-singh•14mo ago

why not use webhooks?

ashleykOP•14mo ago

Webhooks can be unreliable

flash-singh•14mo ago

does this endpoint get a lot of volume?

ashleykOP•14mo ago

It fluctuates, weekends are busier than during the week and evenings are busier than the day time (my day time, because most of the customers are in the US). But there were a lot more requests than usual over the weekend. Not anything massive though, its on average < 1000 requests per day.

ashleykOP•14mo ago

And we havent even done 300 jobs today yet, but it already had 30 failed jobs which is not normal.

ashleykOP•14mo ago

C = Completed
F = Failed
R = Retried

C = Completed
F = Failed
R = Retried

And failed graph to display the spike in failed jobs is above. The table is from the metrics API, and the graph above is from the health API. Also can't use a webhook because its all running on an internal VPC on AWS which is not publicly accessible. And if I use a webhook I can't check whether my jobs are stuck on IN_PROGRESS for too long and automatically cancel them.

flash-singh•14mo ago

may have to change failed to something else when it times put, cant tell if its that or fail at job level

ashleykOP•14mo ago

There is already TIMED_OUT can't it just use that?

flash-singh•14mo ago

yes thats the plan, in my fix i made it failed, should change it back

ashleykOP•14mo ago

Oh yeah, I think its better to change it back 👍

JM•14mo ago

Hey @ashleyk Hit me up with your endpoint ID; will provide you credits 👍

octopus•14mo ago

Gotta give @ashleyk a job at this point, he helps everyone

ashleykOP•14mo ago

HI @JMendpoint id is sdj01thu7r2mxx. There were issues 24th, 25th, 26th Jan, where my billing escalated more than usual, and the the execution time spiked slightly on 24th and 25th but massively on 26th when there were so many failed jobs. Things seem to have stabilised from 26th Feb onwards.

ashleykOP•14mo ago

By the way, there is a large gap because I had to to switch my endpoint to a different region because all my workers were throttled.

JM•14mo ago

Uh That's no good, thanks for explaining Btw, I was literally buried in work, I found more hardware for everyone Apologies for delay in responding @ashleyk Credited the account! Thanks for helping everyone

ashleykOP•14mo ago

Thats awesome news that you found new hardware @JM , it will make us all very happy, thank you! 🙏 . No worries about the delay in responding and thanks very much for the credits, its cool that you can do it directly now and don't need to issue credit codes anymore. Helping people is only a pleasure. 🫶

JM•14mo ago

Yep, engineering has been helping me and Justin very hard lately; new admin features like this one always help so much! Take care sir, let me know if you need anything. Need to go to bed now

ashleykOP•14mo ago

Awesome news, thanks very much, you take care too, and have a good nights rest 🙏

Gaming

Programming

Unacceptably high failed jobs suddenly

Did you find this page helpful?