How do i retry worker task in runpod serverless?
Good day,
I was moving a worker from pods to serverless. Previously i used azure service bus to send task to my pod. And the service bus message had retry count of 5. But after migrating to the serverless endpoint i didn't find to integrate any message queue system to deliver the task to the worker container. So when it fails it only responses with errors. How i can make it to retry the same request?
i didn't find anything in the documentation.
I took a look into these documents: https://docs.runpod.io/serverless/workers/handlers/handler-error-handling There is something with refresh worker at https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls#refresh-worker But i don't know what is the way to achieve retry mechanism by refreshing the worker.
I took a look into these documents: https://docs.runpod.io/serverless/workers/handlers/handler-error-handling There is something with refresh worker at https://docs.runpod.io/serverless/workers/handlers/handler-additional-controls#refresh-worker But i don't know what is the way to achieve retry mechanism by refreshing the worker.
Handling Errors | RunPod Documentation
Learn how to handle exceptions and implement custom error responses in your RunPod SDK handler function, including how to validate input and return customized error messages.
Additional controls | RunPod Documentation
Send progress updates during job execution using the runpod.serverless.progress_update function, and refresh workers for long-running or complex jobs by returning a dictionary with a 'refresh_worker' flag in your handler.
4 Replies
There is no retry mechanism. Typically if there is an error, you don't want to retry the request and keep wasting credits. If you want a retry mechanism, you will have to build it yourself.
RunPod uses Redis for its queues, not a normal message queue system.
Implementing a retry mechanism is crucial for handling failures when interacting with external systems. Directly coding retries can be complex and messy. Instead, a queue-based retry system offers a cleaner and more maintainable solution. Message bus systems like Kafka, RabbitMQ, and AWS SQS manage retries by re-queuing failed messages, allowing for efficient and modular handling of retries. This approach improves system reliability and scalability while keeping the core application code clean.
I think exposing an http endpoint for worker is that how created the ambiguity of not offering a retry mechanism.As a customer I would highly appreciate it.
RunPod does retry the request if there is not an exception thrown, but if there is an exception thrown it makes no sense to retry.
RunPod aint gonna implement some stupid nonsensical retry mechanism just for you 🤣
What is the max retry count in that case? Any documentation regarding these informations?
Let them make their decisions and opinions. Thank u for your time.