Job timeout constantly (bug?)
I'm getting the job timeout error constantly in each worker with a random time after. I have seen the logs, there is no error, the pod it's just killed with no reason even having nothing set of timeout in the serverless endpoint ( I have seen it in live), seems that it's totally bugged.
The software it's the same, nothing has been changed and I'm getting this issue all the time, even if I use 16gb or 48gb.
15 Replies
Also the gpu memory it's not reaching the limit, it's just stops with no reason
that means the job is getting lost, worker picks up the job but then stops reporting status on the job its working on, can you make sure your using the latest sdk
yes, i'm using the last version
2024-10-03T11:16:25.150260367Z 30%
2024-10-03T11:16:25.168653046Z 30%
2024-10-03T11:16:25.186904329Z 30%
2024-10-03T11:16:25.205032555Z 30%
2024-10-03T11:16:25.224321673Z 30%
2024-10-03T11:16:27.099648500Z {"logger": "cog.server.http", "timestamp": "2024-10-03T11:16:27.098688Z", "severity": "INFO", "message": "stopping server"}
2024-10-03T11:16:27.109831864Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.109335Z", "severity": "INFO", "message": "Shutting down"}
2024-10-03T11:16:27.211000754Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.210347Z", "severity": "INFO", "message": "Waiting for application shutdown."}
I have tested an image of 5 months ago at seems to fix the issue, looks like it's an issue of libraries
I'm still using the same libraries so must be from an update, i'm trying with doing a downgrade of the cog sdk and runpod sdk
I fixed the issue doing a downgrade of the runpod sdk to the version 1.6.0, now it's working fine
Seems that the last version or the latests versions have a timeout bug
we found the bug, fix in progress
That's great to hear! Thanks for check it out
FYI: v1.7.2 is on pre-release while I do some final tests https://github.com/runpod/runpod-python/releases/tag/1.7.2
GitHub
Release 1.7.2 · runpod/runpod-python
What's Changed
Corrected job_take_url by @deanq in #359
Update cryptography requirement from <43.0.0 to <44.0.0 by @dependabot in #353
fix: pings were missing requestIds since the last b...
1.7.2 is officially the latest release as of today
I'm encountering the same issue with version 1.7.3 @deanQ
Same issue with 1.7.3
Hi. Please file a support ticket and mention this thread so that you can share more info that would help us determine what's going on and how to fix it. Feel free to mention me on your tickets. Thank you.
I just discovered that if the idle timeout setting is set too long and your job also takes a long time to finish, it might cause the job to retry. I’m still testing this and will share more info soon. For now, try setting the idle timeout to less than 20 seconds and see if that helps.
I reported the issue but quickly resolved it by downgrading the SDK. However, I recall experiencing other strange behaviors, such as two workers starting up simultaneously for a single request. Additionally, there were instances where a worker would start even though the Docker image was still downloading, which incurred costs, even after canceling the request. I had to manually terminate the worker in those cases. Maybe these issues are related
I had my idle timeout set to 5-10 seconds
By the way, this only happened with longer-running requests, where the timeout occurred after around 100-200 seconds I think. Shorter-running jobs completed without any issues
The UI have some delays and doesn’t always display real-time status, so what you described—like one request waking up two workers—shouldn’t be possible. Similarly, if an image hasn’t finished downloading, the worker isn’t considered running, and we don’t charge for that. Our billing only starts once the worker is up and running.
This is what I meant. I'm not entirely sure, but it really seems to be only UI
Are you downloading model every time when you worker receive a request ? This would be very inefficient, you might save your model into our network volume to avoid this or bake the model as part of your docker image