Samuel
RRunPod
•Created by Samuel on 3/12/2024 in #⚡|serverless
Failed Serverless Jobs drain Complete Balance
Hi,
just like this GitHub issue (https://github.com/runpod-workers/worker-vllm/issues/29), I had my balance drained completely multiple times due to serverless jobs stuck and automatically restarting. Jobs can fail for many different reasons, so testing them thoroughly is very hard without a higher load.
A month ago it was announced that a feature to solve the issue would be introduced (see image). However, I could not find any configuration to limit the number of retries for a failed serverless inference in the UI, only a configuration to enable the Execution Timeout. Therefore two questions:
1. Is the feature to automatically kill jobs after n failed execution attempts already introduced but not configurable by the user? If so, what is the limit?
2. Is the total execution timeout (configurable per endpoint or per request via API) is counted per job execution or per job? E.g. would a limit of 100 seconds be only reached if the job ran for 100 seconds without interruption or would it also be reached if the job ran a first time, failed after 60 seconds, and ran the second time without failure for (more than) 40 seconds?
Right now I am postponing launching my side-project due to this issue. I am afraid 500€ will be gone overnight due to some bug. If this is not solved yet, I would be glad to hear a timeline. Worst case, some guidance regarding the API to monitor active jobs and create a monitoring service that would kill the job after n failed attempts would be appreciated.
Thank you very much!
5 replies