RunPod•13mo ago

How to know when request is failed

Hello, everyone I am using webhook to be notified for job completion. I wondering if this webhook is also called when request is failed. Or is there any other way to know whether request is failed? What I mean is, some requests will be in queue when there are many requests. And after time limit, that requests will terminate automatically. In that case, how to know those requests are failed? In that case, is webhook called with "FAILED" status or not? Thanks in advance.

17 Replies

ashleyk•13mo ago

Yes, a webhook is fired for failed jobs. Requests in the queue don't terminate automatically based on a time limit. You can set executionTimeout for your jobs, but that has nothing to do with the amount of time a request is in the queue, the job gets failed if the execution time is higher than the specified executionTimeout. Max jobs in queue is max workers * 100. I don't know what happens when you reach that threshold though, maybe @flash-singh can confirm.

topOP•13mo ago

What if I set "ttl (time-to-live)"?

flash-singh•13mo ago

throws 4xx error on /run or /runsync ttl does not impact max jobs in queue if jobs fail due to ttl, you do not get failed webhook, at that point I would increase ttl so that never happens, max timeframe is 1 week

ashleyk•13mo ago

Why would they fail due to ttl? I thought ttl was the time to keep the output.

flash-singh•13mo ago

we use redis, every job goes into redis with a ttl, after that its garbage collected, ttl of output once job is done is changed to 30m we do not have a way of detecting when redis purges a job based on ttl

ashleyk•13mo ago

Oh so if you set your ttl too short and the job is still in progress?

flash-singh•13mo ago

yes redis will delete it, we will detect that job is in progress but there is no trace of it in our db, we will stop the worker

ashleyk•13mo ago

Ah gotcha, thanks

flash-singh•13mo ago

ttl comes more into play when there are no workers running or jobs have piled up so much that they will never get completed in time, hence why we also have a max jobs allowed in queue based on max workers

ashleyk•13mo ago

Pretty complex stuff 😅

flash-singh•13mo ago

covering all these edge cases gets complex, even trying to handle so jobs dont disapear requires reliable queuing, redis helps but managing it has been challenging

topOP•13mo ago

so is failed webhook called only when job is failed while in progress? not when it is automatically terminated due to ttl?

flash-singh•13mo ago

yes, ttl shouldn't be an issue and you can increase it if you think default is too low

topOP•13mo ago

Currently, I save status of each request on firebase DB and change status as "Generating" after "/run" request is successful. And using webhook to be notified for job completion. I need to set status as "Failed" if the job is not completed after 4 hrs since that job is requested. Is there any way to implement this? Thanks in advace.

flash-singh•13mo ago

run lambda or some cron job that iterates non completed jobs and sets them as failed in your db, and also cancel jobs on runpod side if they're still in queue

topOP•13mo ago

Thanks what happen if I try to cancel the job which is not in queue?

ashleyk•13mo ago

Looks like you get an HTTP 401 error.

Gaming

Programming

How to know when request is failed

Did you find this page helpful?