R
RunPod4mo ago
vitalik

Job retry after successful run

My endpoint started to have retries for every request even though the first run is successful without any errors. Don't understand why that is happening. That is what I see in the logs when first run finishes, and retry starts 2024-10-10T11:51:52.937738320Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"} 2024-10-10T11:51:52.972812780Z {"requestId": "e5746a57-2af3-4849-84d1-b58d24480627-e1", "message": "Finished.", "level": "INFO"} 2024-10-10T11:51:52.972908181Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"} 2024-10-10T11:51:52.973024343Z {"requestId": "e5746a57-2af3-4849-84d1-b58d24480627-e1", "message": "Started.", "level": "INFO"}
19 Replies
vitalik
vitalikOP4mo ago
seems like turning off flashboot solved the problem, but not sure, maybe just coincidence
Mihály
Mihály4mo ago
For me, upgrading the SDK from 1.7.1 to ,1.7.2 got rid of the retries
vitalik
vitalikOP4mo ago
thanks, i'll try
xuanyu
xuanyu4mo ago
same issue how to resolve this? I am using 1.7.2 and turned off flashboot
furkan.huudle
furkan.huudle3mo ago
you mentioned runpod cli sdk? because Im not using any cli, just deploy to serverless in dashboard, and I don't see any sdk selection (my container image: nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04)
Brandon
Brandon3mo ago
Has anyone found a fix to the issue? I also get successful runs, but immediately after, the job retries and the worker subsequently fails:
2024-10-20T18:33:35.201959405Z {"requestId": "1b732766-f006-4825-8d71-ba4908d01a78-e1", "message": "Finished.", "level": "INFO"}
2024-10-20T18:33:35.792234482Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-20T18:33:35.792291243Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-20T18:33:35.806365607Z {"requestId": "1b732766-f006-4825-8d71-ba4908d01a78-e1", "message": "Started.", "level": "INFO"}
2024-10-20T18:33:35.201959405Z {"requestId": "1b732766-f006-4825-8d71-ba4908d01a78-e1", "message": "Finished.", "level": "INFO"}
2024-10-20T18:33:35.792234482Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-20T18:33:35.792291243Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-20T18:33:35.806365607Z {"requestId": "1b732766-f006-4825-8d71-ba4908d01a78-e1", "message": "Started.", "level": "INFO"}
2024-10-20 19:24:33.267 [cc1g0dj5wo63pu] [error] Failed to return job results. | 400, message='Bad Request'
spooky
spooky3mo ago
Same issue. Job is successfully returning. but my status always says IN_PROGRESS and eventually retries then job timed out after 1 retries trying upgrading the SDK from 1.7.1 to ,1.7.2 will see that seems to have resolved it
inc3pt.io
inc3pt.io3mo ago
Same issue, started to happen with 1.7.3
Blitzkriek
Blitzkriek3mo ago
Same here. Would downgrading to 1.7.2 be a good idea? I see other people have better luck with 1.7.2
inc3pt.io
inc3pt.io3mo ago
1.7.2 had other issues that were worse IMO like freezing requests that would just fill up the request queue I am thinking of going back to 1.6.2, that was the last good working one for me I feel like the runpod-python SDK is not being actively developed, issues are persisting for too long @yhlong00000 can you please step up in here?
yhlong00000
yhlong000003mo ago
Hey, sorry about this. We recently made a few changes to the SDK, and unfortunately, each version has its own issues. If you want to revert back to 1.6.2, that should work fine for now. We have an internal ticket tracking these problems, and our team is actively working on it. I’ll keep you updated once we have a more stable version.
inc3pt.io
inc3pt.io3mo ago
Okay I'll revert to 1.6.2 then, thanks for the info FYI - we have reverted to 1.7.0 as we have noticed that 1.6.2 has a lower FPS (we are processing frames in real time).
Vllen
Vllen3mo ago
I ran into similar issues too. Containers just randomly get removed. No errors in log. it seems to be fixed after switching to 1.6.2.
deanQ
deanQ3mo ago
SDK 1.7.4 has been released. Thank you for your patience.
Vllen
Vllen3mo ago
Just tried 1.7.4. its not fixed. worker didnt crash but It did seem like as soon as the cpu usage hit 100%, the container got removed by the worker immediately.
yhlong00000
yhlong000003mo ago
can you provide me a pod id? I can take a look the log
inc3pt.io
inc3pt.io3mo ago
@yhlong00000 I am still having issues with 1.7.4: - Only one request converts to IN_PROGRESS, all others stay in IN_QUEUE. Even though it can accept multiple request counts and there are available workers in idle state. Tasks are long running. Worker ID for you to debug: ddywfiz37lbsaz - Also, maybe a webUI bug but I also had instances of jobs still appear IN_PROGRESS in the webUI with the corresponding workers no more active (workers: jwesqcl6bb0194 and 1t9jehqvp73esy) - I also had one instance of 400 Bad request with another endpoint: 2024-10-27T12:04:05.295467963Z {"requestId": "f277cfe0-45e1-4187-90f0-15abb69348c3-u1", "message": "Failed to return job results. | 400, message='Bad Request', url=URL('https://api.runpod.ai/v2/c2b******f1mf/job-done/jwesqcl6bb0194/f277cfe0-45e1-4187-90f0-15abb69348c3-u1?gpu=$RUNPOD_GPU_TYPE_ID&isStream=false')", "level": "ERROR"} Reverted to v1.7.0, I find this version more efficient compared to 1.6.2
yhlong00000
yhlong000003mo ago
For the xdqp27q6z2yjxf endpoint, I noticed it hasn’t been able to scale up for a while. It seems like you made an update that triggered the system to replace the old worker. Could you try setting the max worker to 3 and see if you still experience the issue? As for request f277cfe0, it occurred right after you triggered the update on that endpoint. The worker was busy handling one request, but it couldn’t process this new request within a minute, so it was sent back to the queue.
inc3pt.io
inc3pt.io3mo ago
I will retry it later today and let you know how it goes, thanks. @yhlong00000 I have updated Runpod again today to 1.7.4 to see if the issue of only one process going from IN_QUEUE to IN_PROGRESS still persists. It still produces the same issue: Only one process can be processed at a time. My processes are long running. Endpoint: j2j9d6odh3qsi3 / 1 worker set with Queue Delay as scaling strategy. Note: I was using version 1.7.0 and no issue at all here.

Did you find this page helpful?