Job timeout constantly (bug?)

I'm getting the job timeout error constantly in each worker with a random time after. I have seen the logs, there is no error, the pod it's just killed with no reason even having nothing set of timeout in the serverless endpoint ( I have seen it in live), seems that it's totally bugged. The software it's the same, nothing has been changed and I'm getting this issue all the time, even if I use 16gb or 48gb.
No description
No description
13 Replies
Keffisor21
Keffisor213w ago
Also the gpu memory it's not reaching the limit, it's just stops with no reason
flash-singh
flash-singh3w ago
that means the job is getting lost, worker picks up the job but then stops reporting status on the job its working on, can you make sure your using the latest sdk
Keffisor21
Keffisor213w ago
yes, i'm using the last version 2024-10-03T11:16:25.150260367Z 30% 2024-10-03T11:16:25.168653046Z 30% 2024-10-03T11:16:25.186904329Z 30% 2024-10-03T11:16:25.205032555Z 30% 2024-10-03T11:16:25.224321673Z 30% 2024-10-03T11:16:27.099648500Z {"logger": "cog.server.http", "timestamp": "2024-10-03T11:16:27.098688Z", "severity": "INFO", "message": "stopping server"} 2024-10-03T11:16:27.109831864Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.109335Z", "severity": "INFO", "message": "Shutting down"} 2024-10-03T11:16:27.211000754Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.210347Z", "severity": "INFO", "message": "Waiting for application shutdown."} I have tested an image of 5 months ago at seems to fix the issue, looks like it's an issue of libraries I'm still using the same libraries so must be from an update, i'm trying with doing a downgrade of the cog sdk and runpod sdk I fixed the issue doing a downgrade of the runpod sdk to the version 1.6.0, now it's working fine Seems that the last version or the latests versions have a timeout bug
flash-singh
flash-singh3w ago
we found the bug, fix in progress
Keffisor21
Keffisor213w ago
That's great to hear! Thanks for check it out
deanQ
deanQ3w ago
FYI: v1.7.2 is on pre-release while I do some final tests https://github.com/runpod/runpod-python/releases/tag/1.7.2
GitHub
Release 1.7.2 · runpod/runpod-python
What's Changed Corrected job_take_url by @deanq in #359 Update cryptography requirement from <43.0.0 to <44.0.0 by @dependabot in #353 fix: pings were missing requestIds since the last b...
deanQ
deanQ3w ago
1.7.2 is officially the latest release as of today
luca
luca6d ago
I'm encountering the same issue with version 1.7.3 @deanQ
DannyB
DannyB6d ago
Same issue with 1.7.3
deanQ
deanQ5d ago
Hi. Please file a support ticket and mention this thread so that you can share more info that would help us determine what's going on and how to fix it. Feel free to mention me on your tickets. Thank you.
yhlong00000
yhlong000004d ago
I just discovered that if the idle timeout setting is set too long and your job also takes a long time to finish, it might cause the job to retry. I’m still testing this and will share more info soon. For now, try setting the idle timeout to less than 20 seconds and see if that helps.
luca
luca3d ago
I reported the issue but quickly resolved it by downgrading the SDK. However, I recall experiencing other strange behaviors, such as two workers starting up simultaneously for a single request. Additionally, there were instances where a worker would start even though the Docker image was still downloading, which incurred costs, even after canceling the request. I had to manually terminate the worker in those cases. Maybe these issues are related I had my idle timeout set to 5-10 seconds By the way, this only happened with longer-running requests, where the timeout occurred after around 100-200 seconds I think. Shorter-running jobs completed without any issues
yhlong00000
yhlong000003d ago
The UI have some delays and doesn’t always display real-time status, so what you described—like one request waking up two workers—shouldn’t be possible. Similarly, if an image hasn’t finished downloading, the worker isn’t considered running, and we don’t charge for that. Our billing only starts once the worker is up and running.
Want results from more Discord servers?
Add your server