RunPod•14mo ago

Failed Serverless Jobs drain Complete Balance

Hi, just like this GitHub issue (https://github.com/runpod-workers/worker-vllm/issues/29), I had my balance drained completely multiple times due to serverless jobs stuck and automatically restarting. Jobs can fail for many different reasons, so testing them thoroughly is very hard without a higher load. A month ago it was announced that a feature to solve the issue would be introduced (see image). However, I could not find any configuration to limit the number of retries for a failed serverless inference in the UI, only a configuration to enable the Execution Timeout. Therefore two questions: 1. Is the feature to automatically kill jobs after n failed execution attempts already introduced but not configurable by the user? If so, what is the limit? 2. Is the total execution timeout (configurable per endpoint or per request via API) is counted per job execution or per job? E.g. would a limit of 100 seconds be only reached if the job ran for 100 seconds without interruption or would it also be reached if the job ran a first time, failed after 60 seconds, and ran the second time without failure for (more than) 40 seconds? Right now I am postponing launching my side-project due to this issue. I am afraid 500€ will be gone overnight due to some bug. If this is not solved yet, I would be glad to hear a timeline. Worst case, some guidance regarding the API to monitor active jobs and create a monitoring service that would kill the job after n failed attempts would be appreciated. Thank you very much!

GitHub

Issues · runpod-workers/worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - Issues · runpod-workers/worker-vllm

4 Replies

SamuelOP•14mo ago

@Alpay Ariyak maybe you could clarify this? As far as I can see the issue is still open.

marshall•14mo ago

@Samuel your best bet is contacting their support team on the website chatbox... I'm afraid the https://discord.com/channels/912829806415085598/1209942179527663667 issue might be happening again

marshall•14mo ago

I'm doing the same for an unexplainable peak in March 29 - 30 in credit usage:

marshall•14mo ago

I've only noticed it just today and I did my best to collect evidences: https://docs.google.com/document/d/1YC6SZBEAeJ9quevTjcwnQ6hekQbgV6GCcecqD8-QQVo/edit?usp=sharing

Google Docs

Suspicious RunPod credit usage peak in March 29 - March 30 while un...

Gaming

Programming

Failed Serverless Jobs drain Complete Balance

Did you find this page helpful?