Unacceptably high failed jobs suddenly
Suddenly almost 20% of my serverless jobs failed. I have never had this issue until yesterday. This is is completely UNACCEPTABLE that I am being charged for this immense fuck up and that my customers are being impacted. This needs to be resolved IMMEDIATELY and I demand a refund for this!
26 Replies
@flash-singh @Zeen @JM This is completely UNACCEPTABLE and needs to be RESOLVED IMMEDIATELY and I demand a refund.
Must be some infrastructure issue because I don't even have any error logs for any of my failed jobs.
Also extremely suspicious that the increase in failed jobs coincides with less workers being throttled.
Same here. Not 20% but I still way more than before
Jobs should only fail if there is an error in the severless handler code, which never happened in my case.
I also don't know how this issue is supposed to be debugged when there are no error logs for any of the failed jobs.
Looks like most of them failed due to
executionTimeout exceeded
. My jobs shouldn't take more than 5 minutes to execute, there is something wrong with the workers.
My jobs take 3 minutes max to execute so something is seriously wrong here, and I have been running this endpoint in various different regions for several months and never had this issue until now.
I would also expect to see these executionTimeout errors in the logs for my endpoint, but they aren't in the logs.Also I don't know why my IP was rate limited on Saturday, this has never happened before and I wasn't even sending that many requests.
This is making serverless more and more unusable by the day.
Each of those 2 terminal windows has a different public IP as well, so there is really no reason why I should have been rate limited.
is that 429 error in your handler code or somewhere else?
Checking the staus of my jobs mostly.
Also trying to create a new job.
why not use webhooks?
Webhooks can be unreliable
does this endpoint get a lot of volume?
It fluctuates, weekends are busier than during the week and evenings are busier than the day time (my day time, because most of the customers are in the US).
But there were a lot more requests than usual over the weekend.
Not anything massive though, its on average < 1000 requests per day.
And we havent even done 300 jobs today yet, but it already had 30 failed jobs which is not normal.
And failed graph to display the spike in failed jobs is above.
The table is from the metrics API, and the graph above is from the health API.
Also can't use a webhook because its all running on an internal VPC on AWS which is not publicly accessible.
And if I use a webhook I can't check whether my jobs are stuck on IN_PROGRESS for too long and automatically cancel them.
may have to change failed to something else when it times put, cant tell if its that or fail at job level
There is already
TIMED_OUT
can't it just use that?yes thats the plan, in my fix i made it failed, should change it back
Oh yeah, I think its better to change it back 👍
Hey @ashleyk
Hit me up with your endpoint ID; will provide you credits 👍
Gotta give @ashleyk a job at this point, he helps everyone
HI @JMendpoint id is sdj01thu7r2mxx. There were issues 24th, 25th, 26th Jan, where my billing escalated more than usual, and the the execution time spiked slightly on 24th and 25th but massively on 26th when there were so many failed jobs. Things seem to have stabilised from 26th Feb onwards.
By the way, there is a large gap because I had to to switch my endpoint to a different region because all my workers were throttled.
Uh
That's no good, thanks for explaining
Btw, I was literally buried in work, I found more hardware for everyone
Apologies for delay in responding
@ashleyk Credited the account! Thanks for helping everyone
Thats awesome news that you found new hardware @JM , it will make us all very happy, thank you! 🙏 . No worries about the delay in responding and thanks very much for the credits, its cool that you can do it directly now and don't need to issue credit codes anymore. Helping people is only a pleasure. 🫶
Yep, engineering has been helping me and Justin very hard lately; new admin features like this one always help so much!
Take care sir, let me know if you need anything. Need to go to bed now
Awesome news, thanks very much, you take care too, and have a good nights rest 🙏