How to know when request is failed
Hello, everyone
I am using webhook to be notified for job completion.
I wondering if this webhook is also called when request is failed.
Or is there any other way to know whether request is failed?
What I mean is, some requests will be in queue when there are many requests.
And after time limit, that requests will terminate automatically.
In that case, how to know those requests are failed?
In that case, is webhook called with "FAILED" status or not?
Thanks in advance.
17 Replies
Yes, a webhook is fired for failed jobs.
Requests in the queue don't terminate automatically based on a time limit.
You can set
executionTimeout
for your jobs, but that has nothing to do with the amount of time a request is in the queue, the job gets failed if the execution time is higher than the specified executionTimeout
.
Max jobs in queue is max workers * 100. I don't know what happens when you reach that threshold though, maybe @flash-singh can confirm.What if I set "ttl (time-to-live)"?
throws 4xx error on /run or /runsync
ttl does not impact max jobs in queue
if jobs fail due to ttl, you do not get failed webhook, at that point I would increase ttl so that never happens, max timeframe is 1 week
Why would they fail due to ttl? I thought ttl was the time to keep the output.
we use redis, every job goes into redis with a ttl, after that its garbage collected, ttl of output once job is done is changed to 30m
we do not have a way of detecting when redis purges a job based on ttl
Oh so if you set your ttl too short and the job is still in progress?
yes redis will delete it, we will detect that job is in progress but there is no trace of it in our db, we will stop the worker
Ah gotcha, thanks
ttl comes more into play when there are no workers running or jobs have piled up so much that they will never get completed in time, hence why we also have a max jobs allowed in queue based on max workers
Pretty complex stuff 😅
covering all these edge cases gets complex, even trying to handle so jobs dont disapear requires reliable queuing, redis helps but managing it has been challenging
so is failed webhook called only when job is failed while in progress?
not when it is automatically terminated due to ttl?
yes, ttl shouldn't be an issue and you can increase it if you think default is too low
Currently, I save status of each request on firebase DB and change status as "Generating" after "/run" request is successful.
And using webhook to be notified for job completion.
I need to set status as "Failed" if the job is not completed after 4 hrs since that job is requested.
Is there any way to implement this?
Thanks in advace.
run lambda or some cron job that iterates non completed jobs and sets them as failed in your db, and also cancel jobs on runpod side if they're still in queue
Thanks
what happen if I try to cancel the job which is not in queue?
Looks like you get an HTTP 401 error.