R
RunPod10mo ago
Samuel

Failed Serverless Jobs drain Complete Balance

Hi, just like this GitHub issue (https://github.com/runpod-workers/worker-vllm/issues/29), I had my balance drained completely multiple times due to serverless jobs stuck and automatically restarting. Jobs can fail for many different reasons, so testing them thoroughly is very hard without a higher load. A month ago it was announced that a feature to solve the issue would be introduced (see image). However, I could not find any configuration to limit the number of retries for a failed serverless inference in the UI, only a configuration to enable the Execution Timeout. Therefore two questions: 1. Is the feature to automatically kill jobs after n failed execution attempts already introduced but not configurable by the user? If so, what is the limit? 2. Is the total execution timeout (configurable per endpoint or per request via API) is counted per job execution or per job? E.g. would a limit of 100 seconds be only reached if the job ran for 100 seconds without interruption or would it also be reached if the job ran a first time, failed after 60 seconds, and ran the second time without failure for (more than) 40 seconds? Right now I am postponing launching my side-project due to this issue. I am afraid 500€ will be gone overnight due to some bug. If this is not solved yet, I would be glad to hear a timeline. Worst case, some guidance regarding the API to monitor active jobs and create a monitoring service that would kill the job after n failed attempts would be appreciated. Thank you very much!
GitHub
Issues · runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - Issues · runpod-workers/worker-vllm
4 Replies
Samuel
SamuelOP10mo ago
@Alpay Ariyak maybe you could clarify this? As far as I can see the issue is still open.
marshall
marshall9mo ago
@Samuel your best bet is contacting their support team on the website chatbox... I'm afraid the https://discord.com/channels/912829806415085598/1209942179527663667 issue might be happening again
marshall
marshall9mo ago
I'm doing the same for an unexplainable peak in March 29 - 30 in credit usage:
No description
Want results from more Discord servers?
Add your server