R
RunPod•5mo ago
top

How to know when request is failed

Hello, everyone I am using webhook to be notified for job completion. I wondering if this webhook is also called when request is failed. Or is there any other way to know whether request is failed? What I mean is, some requests will be in queue when there are many requests. And after time limit, that requests will terminate automatically. In that case, how to know those requests are failed? In that case, is webhook called with "FAILED" status or not? Thanks in advance.
17 Replies
ashleyk
ashleyk•5mo ago
Yes, a webhook is fired for failed jobs. Requests in the queue don't terminate automatically based on a time limit. You can set executionTimeout for your jobs, but that has nothing to do with the amount of time a request is in the queue, the job gets failed if the execution time is higher than the specified executionTimeout. Max jobs in queue is max workers * 100. I don't know what happens when you reach that threshold though, maybe @flash-singh can confirm.
top
top•5mo ago
What if I set "ttl (time-to-live)"?
flash-singh
flash-singh•5mo ago
throws 4xx error on /run or /runsync ttl does not impact max jobs in queue if jobs fail due to ttl, you do not get failed webhook, at that point I would increase ttl so that never happens, max timeframe is 1 week
ashleyk
ashleyk•5mo ago
Why would they fail due to ttl? I thought ttl was the time to keep the output.
flash-singh
flash-singh•5mo ago
we use redis, every job goes into redis with a ttl, after that its garbage collected, ttl of output once job is done is changed to 30m we do not have a way of detecting when redis purges a job based on ttl
ashleyk
ashleyk•5mo ago
Oh so if you set your ttl too short and the job is still in progress?
flash-singh
flash-singh•5mo ago
yes redis will delete it, we will detect that job is in progress but there is no trace of it in our db, we will stop the worker
ashleyk
ashleyk•5mo ago
Ah gotcha, thanks
flash-singh
flash-singh•5mo ago
ttl comes more into play when there are no workers running or jobs have piled up so much that they will never get completed in time, hence why we also have a max jobs allowed in queue based on max workers
ashleyk
ashleyk•5mo ago
Pretty complex stuff 😅
flash-singh
flash-singh•5mo ago
covering all these edge cases gets complex, even trying to handle so jobs dont disapear requires reliable queuing, redis helps but managing it has been challenging
top
top•5mo ago
so is failed webhook called only when job is failed while in progress? not when it is automatically terminated due to ttl?
flash-singh
flash-singh•5mo ago
yes, ttl shouldn't be an issue and you can increase it if you think default is too low
top
top•5mo ago
Currently, I save status of each request on firebase DB and change status as "Generating" after "/run" request is successful. And using webhook to be notified for job completion. I need to set status as "Failed" if the job is not completed after 4 hrs since that job is requested. Is there any way to implement this? Thanks in advace.
flash-singh
flash-singh•5mo ago
run lambda or some cron job that iterates non completed jobs and sets them as failed in your db, and also cancel jobs on runpod side if they're still in queue
top
top•5mo ago
Thanks what happen if I try to cancel the job which is not in queue?
ashleyk
ashleyk•5mo ago
Looks like you get an HTTP 401 error.