Job timeout constantly (bug?)

I'm getting the job timeout error constantly in each worker with a random time after. I have seen the logs, there is no error, the pod it's just killed with no reason even having nothing set of timeout in the serverless endpoint ( I have seen it in live), seems that it's totally bugged. The software it's the same, nothing has been changed and I'm getting this issue all the time, even if I use 16gb or 48gb.
No description
No description
15 Replies
Keffisor21
Keffisor21OP4mo ago
Also the gpu memory it's not reaching the limit, it's just stops with no reason
flash-singh
flash-singh4mo ago
that means the job is getting lost, worker picks up the job but then stops reporting status on the job its working on, can you make sure your using the latest sdk
Keffisor21
Keffisor21OP4mo ago
yes, i'm using the last version 2024-10-03T11:16:25.150260367Z 30% 2024-10-03T11:16:25.168653046Z 30% 2024-10-03T11:16:25.186904329Z 30% 2024-10-03T11:16:25.205032555Z 30% 2024-10-03T11:16:25.224321673Z 30% 2024-10-03T11:16:27.099648500Z {"logger": "cog.server.http", "timestamp": "2024-10-03T11:16:27.098688Z", "severity": "INFO", "message": "stopping server"} 2024-10-03T11:16:27.109831864Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.109335Z", "severity": "INFO", "message": "Shutting down"} 2024-10-03T11:16:27.211000754Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.210347Z", "severity": "INFO", "message": "Waiting for application shutdown."} I have tested an image of 5 months ago at seems to fix the issue, looks like it's an issue of libraries I'm still using the same libraries so must be from an update, i'm trying with doing a downgrade of the cog sdk and runpod sdk I fixed the issue doing a downgrade of the runpod sdk to the version 1.6.0, now it's working fine Seems that the last version or the latests versions have a timeout bug
flash-singh
flash-singh4mo ago
we found the bug, fix in progress
Keffisor21
Keffisor21OP4mo ago
That's great to hear! Thanks for check it out
deanQ
deanQ4mo ago
FYI: v1.7.2 is on pre-release while I do some final tests https://github.com/runpod/runpod-python/releases/tag/1.7.2
GitHub
Release 1.7.2 · runpod/runpod-python
What's Changed Corrected job_take_url by @deanq in #359 Update cryptography requirement from <43.0.0 to <44.0.0 by @dependabot in #353 fix: pings were missing requestIds since the last b...
deanQ
deanQ4mo ago
1.7.2 is officially the latest release as of today
luca
luca3mo ago
I'm encountering the same issue with version 1.7.3 @deanQ
DannyB
DannyB3mo ago
Same issue with 1.7.3
deanQ
deanQ3mo ago
Hi. Please file a support ticket and mention this thread so that you can share more info that would help us determine what's going on and how to fix it. Feel free to mention me on your tickets. Thank you.
yhlong00000
yhlong000003mo ago
I just discovered that if the idle timeout setting is set too long and your job also takes a long time to finish, it might cause the job to retry. I’m still testing this and will share more info soon. For now, try setting the idle timeout to less than 20 seconds and see if that helps.
luca
luca3mo ago
I reported the issue but quickly resolved it by downgrading the SDK. However, I recall experiencing other strange behaviors, such as two workers starting up simultaneously for a single request. Additionally, there were instances where a worker would start even though the Docker image was still downloading, which incurred costs, even after canceling the request. I had to manually terminate the worker in those cases. Maybe these issues are related I had my idle timeout set to 5-10 seconds By the way, this only happened with longer-running requests, where the timeout occurred after around 100-200 seconds I think. Shorter-running jobs completed without any issues
yhlong00000
yhlong000003mo ago
The UI have some delays and doesn’t always display real-time status, so what you described—like one request waking up two workers—shouldn’t be possible. Similarly, if an image hasn’t finished downloading, the worker isn’t considered running, and we don’t charge for that. Our billing only starts once the worker is up and running.
luca
luca3mo ago
This is what I meant. I'm not entirely sure, but it really seems to be only UI
No description
yhlong00000
yhlong000003mo ago
Are you downloading model every time when you worker receive a request ? This would be very inefficient, you might save your model into our network volume to avoid this or bake the model as part of your docker image

Did you find this page helpful?