RunPod•6mo ago

Job retry after successful run

My endpoint started to have retries for every request even though the first run is successful without any errors. Don't understand why that is happening. That is what I see in the logs when first run finishes, and retry starts 2024-10-10T11:51:52.937738320Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"} 2024-10-10T11:51:52.972812780Z {"requestId": "e5746a57-2af3-4849-84d1-b58d24480627-e1", "message": "Finished.", "level": "INFO"} 2024-10-10T11:51:52.972908181Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"} 2024-10-10T11:51:52.973024343Z {"requestId": "e5746a57-2af3-4849-84d1-b58d24480627-e1", "message": "Started.", "level": "INFO"}

19 Replies

vitalikOP•6mo ago

seems like turning off flashboot solved the problem, but not sure, maybe just coincidence

Mihály•6mo ago

For me, upgrading the SDK from 1.7.1 to ,1.7.2 got rid of the retries

vitalikOP•6mo ago

thanks, i'll try

xuanyu•6mo ago

same issue how to resolve this? I am using 1.7.2 and turned off flashboot

furkan.huudle•6mo ago

you mentioned runpod cli sdk? because Im not using any cli, just deploy to serverless in dashboard, and I don't see any sdk selection (my container image: nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04)

Brandon•6mo ago

Has anyone found a fix to the issue? I also get successful runs, but immediately after, the job retries and the worker subsequently fails:

2024-10-20T18:33:35.201959405Z {"requestId": "1b732766-f006-4825-8d71-ba4908d01a78-e1", "message": "Finished.", "level": "INFO"}
2024-10-20T18:33:35.792234482Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-20T18:33:35.792291243Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-20T18:33:35.806365607Z {"requestId": "1b732766-f006-4825-8d71-ba4908d01a78-e1", "message": "Started.", "level": "INFO"}

2024-10-20T18:33:35.201959405Z {"requestId": "1b732766-f006-4825-8d71-ba4908d01a78-e1", "message": "Finished.", "level": "INFO"}
2024-10-20T18:33:35.792234482Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-20T18:33:35.792291243Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-20T18:33:35.806365607Z {"requestId": "1b732766-f006-4825-8d71-ba4908d01a78-e1", "message": "Started.", "level": "INFO"}

2024-10-20 19:24:33.267 [cc1g0dj5wo63pu] [error] Failed to return job results. | 400, message='Bad Request'

spooky•6mo ago

Same issue. Job is successfully returning. but my status always says IN_PROGRESS and eventually retries then job timed out after 1 retries trying upgrading the SDK from 1.7.1 to ,1.7.2 will see that seems to have resolved it

inc3pt.io•6mo ago

Same issue, started to happen with 1.7.3

Blitzkriek•6mo ago

Same here. Would downgrading to 1.7.2 be a good idea? I see other people have better luck with 1.7.2

inc3pt.io•6mo ago

1.7.2 had other issues that were worse IMO like freezing requests that would just fill up the request queue I am thinking of going back to 1.6.2, that was the last good working one for me I feel like the runpod-python SDK is not being actively developed, issues are persisting for too long @yhlong00000 can you please step up in here?

yhlong00000•6mo ago

Hey, sorry about this. We recently made a few changes to the SDK, and unfortunately, each version has its own issues. If you want to revert back to 1.6.2, that should work fine for now. We have an internal ticket tracking these problems, and our team is actively working on it. I’ll keep you updated once we have a more stable version.

inc3pt.io•6mo ago

Okay I'll revert to 1.6.2 then, thanks for the info FYI - we have reverted to 1.7.0 as we have noticed that 1.6.2 has a lower FPS (we are processing frames in real time).

Vllen•6mo ago

I ran into similar issues too. Containers just randomly get removed. No errors in log. it seems to be fixed after switching to 1.6.2.

deanQ•6mo ago

SDK 1.7.4 has been released. Thank you for your patience.

Vllen•6mo ago

Just tried 1.7.4. its not fixed. worker didnt crash but It did seem like as soon as the cpu usage hit 100%, the container got removed by the worker immediately.

yhlong00000•6mo ago

can you provide me a pod id? I can take a look the log

inc3pt.io•6mo ago

@yhlong00000 I am still having issues with 1.7.4: - Only one request converts to IN_PROGRESS, all others stay in IN_QUEUE. Even though it can accept multiple request counts and there are available workers in idle state. Tasks are long running. Worker ID for you to debug: ddywfiz37lbsaz - Also, maybe a webUI bug but I also had instances of jobs still appear IN_PROGRESS in the webUI with the corresponding workers no more active (workers: jwesqcl6bb0194 and 1t9jehqvp73esy) - I also had one instance of 400 Bad request with another endpoint:

2024-10-27T12:04:05.295467963Z {"requestId": "f277cfe0-45e1-4187-90f0-15abb69348c3-u1", "message": "Failed to return job results. | 400, message='Bad Request', url=URL('https://api.runpod.ai/v2/c2b******f1mf/job-done/jwesqcl6bb0194/f277cfe0-45e1-4187-90f0-15abb69348c3-u1?gpu=$RUNPOD_GPU_TYPE_ID&isStream=false')", "level": "ERROR"}

Reverted to v1.7.0, I find this version more efficient compared to 1.6.2

yhlong00000•6mo ago

For the xdqp27q6z2yjxf endpoint, I noticed it hasn’t been able to scale up for a while. It seems like you made an update that triggered the system to replace the old worker. Could you try setting the max worker to 3 and see if you still experience the issue? As for request f277cfe0, it occurred right after you triggered the update on that endpoint. The worker was busy handling one request, but it couldn’t process this new request within a minute, so it was sent back to the queue.

inc3pt.io•6mo ago

I will retry it later today and let you know how it goes, thanks. @yhlong00000 I have updated Runpod again today to 1.7.4 to see if the issue of only one process going from IN_QUEUE to IN_PROGRESS still persists. It still produces the same issue: Only one process can be processed at a time. My processes are long running. Endpoint: j2j9d6odh3qsi3 / 1 worker set with Queue Delay as scaling strategy. Note: I was using version 1.7.0 and no issue at all here.

Gaming

Programming

Job retry after successful run

Did you find this page helpful?