RunPod•7mo ago

Job timeout constantly (bug?)

I'm getting the job timeout error constantly in each worker with a random time after. I have seen the logs, there is no error, the pod it's just killed with no reason even having nothing set of timeout in the serverless endpoint ( I have seen it in live), seems that it's totally bugged. The software it's the same, nothing has been changed and I'm getting this issue all the time, even if I use 16gb or 48gb.

15 Replies

Keffisor21OP•7mo ago

Also the gpu memory it's not reaching the limit, it's just stops with no reason

flash-singh•7mo ago

that means the job is getting lost, worker picks up the job but then stops reporting status on the job its working on, can you make sure your using the latest sdk

Keffisor21OP•7mo ago

yes, i'm using the last version 2024-10-03T11:16:25.150260367Z 30% 2024-10-03T11:16:25.168653046Z 30% 2024-10-03T11:16:25.186904329Z 30% 2024-10-03T11:16:25.205032555Z 30% 2024-10-03T11:16:25.224321673Z 30% 2024-10-03T11:16:27.099648500Z {"logger": "cog.server.http", "timestamp": "2024-10-03T11:16:27.098688Z", "severity": "INFO", "message": "stopping server"} 2024-10-03T11:16:27.109831864Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.109335Z", "severity": "INFO", "message": "Shutting down"} 2024-10-03T11:16:27.211000754Z {"logger": "uvicorn.error", "timestamp": "2024-10-03T11:16:27.210347Z", "severity": "INFO", "message": "Waiting for application shutdown."} I have tested an image of 5 months ago at seems to fix the issue, looks like it's an issue of libraries I'm still using the same libraries so must be from an update, i'm trying with doing a downgrade of the cog sdk and runpod sdk I fixed the issue doing a downgrade of the runpod sdk to the version 1.6.0, now it's working fine Seems that the last version or the latests versions have a timeout bug

flash-singh•7mo ago

we found the bug, fix in progress

Keffisor21OP•7mo ago

That's great to hear! Thanks for check it out

deanQ•7mo ago

FYI: v1.7.2 is on pre-release while I do some final tests https://github.com/runpod/runpod-python/releases/tag/1.7.2

GitHub

Release 1.7.2 · runpod/runpod-python

What's Changed Corrected job_take_url by @deanq in #359 Update cryptography requirement from <43.0.0 to <44.0.0 by @dependabot in #353 fix: pings were missing requestIds since the last b...

deanQ•7mo ago

1.7.2 is officially the latest release as of today

luca•7mo ago

I'm encountering the same issue with version 1.7.3 @deanQ

DannyB•7mo ago

Same issue with 1.7.3

deanQ•7mo ago

Hi. Please file a support ticket and mention this thread so that you can share more info that would help us determine what's going on and how to fix it. Feel free to mention me on your tickets. Thank you.

yhlong00000•7mo ago

I just discovered that if the idle timeout setting is set too long and your job also takes a long time to finish, it might cause the job to retry. I’m still testing this and will share more info soon. For now, try setting the idle timeout to less than 20 seconds and see if that helps.

luca•7mo ago

I reported the issue but quickly resolved it by downgrading the SDK. However, I recall experiencing other strange behaviors, such as two workers starting up simultaneously for a single request. Additionally, there were instances where a worker would start even though the Docker image was still downloading, which incurred costs, even after canceling the request. I had to manually terminate the worker in those cases. Maybe these issues are related I had my idle timeout set to 5-10 seconds By the way, this only happened with longer-running requests, where the timeout occurred after around 100-200 seconds I think. Shorter-running jobs completed without any issues

yhlong00000•7mo ago

The UI have some delays and doesn’t always display real-time status, so what you described—like one request waking up two workers—shouldn’t be possible. Similarly, if an image hasn’t finished downloading, the worker isn’t considered running, and we don’t charge for that. Our billing only starts once the worker is up and running.

luca•6mo ago

This is what I meant. I'm not entirely sure, but it really seems to be only UI

yhlong00000•6mo ago

Are you downloading model every time when you worker receive a request ? This would be very inefficient, you might save your model into our network volume to avoid this or bake the model as part of your docker image

Gaming

Programming

Job timeout constantly (bug?)

Did you find this page helpful?