Job retry after successful run
My endpoint started to have retries for every request even though the first run is successful without any errors. Don't understand why that is happening.
That is what I see in the logs when first run finishes, and retry starts
2024-10-10T11:51:52.937738320Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-10T11:51:52.972812780Z {"requestId": "e5746a57-2af3-4849-84d1-b58d24480627-e1", "message": "Finished.", "level": "INFO"}
2024-10-10T11:51:52.972908181Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-10T11:51:52.973024343Z {"requestId": "e5746a57-2af3-4849-84d1-b58d24480627-e1", "message": "Started.", "level": "INFO"}
19 Replies
seems like turning off flashboot solved the problem, but not sure, maybe just coincidence
For me, upgrading the SDK from 1.7.1 to ,1.7.2 got rid of the retries
thanks, i'll try
same issue
how to resolve this? I am using 1.7.2 and turned off flashboot
you mentioned runpod cli sdk? because Im not using any cli, just deploy to serverless in dashboard, and I don't see any sdk selection (my container image: nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04)
Has anyone found a fix to the issue? I also get successful runs, but immediately after, the job retries and the worker subsequently fails:
2024-10-20 19:24:33.267
[cc1g0dj5wo63pu]
[error]
Failed to return job results. | 400, message='Bad Request'
Same issue. Job is successfully returning. but my status always says IN_PROGRESS and eventually retries then job timed out after 1 retries
trying upgrading the SDK from 1.7.1 to ,1.7.2 will see
that seems to have resolved it
Same issue, started to happen with 1.7.3
Same here. Would downgrading to 1.7.2 be a good idea? I see other people have better luck with 1.7.2
1.7.2 had other issues that were worse IMO like freezing requests that would just fill up the request queue
I am thinking of going back to 1.6.2, that was the last good working one for me
I feel like the runpod-python SDK is not being actively developed, issues are persisting for too long
@yhlong00000 can you please step up in here?
Hey, sorry about this. We recently made a few changes to the SDK, and unfortunately, each version has its own issues. If you want to revert back to 1.6.2, that should work fine for now. We have an internal ticket tracking these problems, and our team is actively working on it. I’ll keep you updated once we have a more stable version.
Okay I'll revert to 1.6.2 then, thanks for the info
FYI - we have reverted to 1.7.0 as we have noticed that 1.6.2 has a lower FPS (we are processing frames in real time).
I ran into similar issues too. Containers just randomly get removed. No errors in log.
it seems to be fixed after switching to 1.6.2.
SDK 1.7.4 has been released. Thank you for your patience.
Just tried 1.7.4. its not fixed. worker didnt crash but It did seem like as soon as the cpu usage hit 100%, the container got removed by the worker immediately.
can you provide me a pod id? I can take a look the log
@yhlong00000 I am still having issues with 1.7.4:
- Only one request converts to IN_PROGRESS, all others stay in IN_QUEUE. Even though it can accept multiple request counts and there are available workers in idle state. Tasks are long running. Worker ID for you to debug:
ddywfiz37lbsaz
- Also, maybe a webUI bug but I also had instances of jobs still appear IN_PROGRESS in the webUI with the corresponding workers no more active (workers: jwesqcl6bb0194
and 1t9jehqvp73esy
)
- I also had one instance of 400 Bad request with another endpoint: 2024-10-27T12:04:05.295467963Z {"requestId": "f277cfe0-45e1-4187-90f0-15abb69348c3-u1", "message": "Failed to return job results. | 400, message='Bad Request', url=URL('https://api.runpod.ai/v2/c2b******f1mf/job-done/jwesqcl6bb0194/f277cfe0-45e1-4187-90f0-15abb69348c3-u1?gpu=$RUNPOD_GPU_TYPE_ID&isStream=false')", "level": "ERROR"}
Reverted to v1.7.0, I find this version more efficient compared to 1.6.2For the xdqp27q6z2yjxf endpoint, I noticed it hasn’t been able to scale up for a while. It seems like you made an update that triggered the system to replace the old worker. Could you try setting the max worker to 3 and see if you still experience the issue?
As for request f277cfe0, it occurred right after you triggered the update on that endpoint. The worker was busy handling one request, but it couldn’t process this new request within a minute, so it was sent back to the queue.
I will retry it later today and let you know how it goes, thanks.
@yhlong00000 I have updated Runpod again today to 1.7.4 to see if the issue of only one process going from IN_QUEUE to IN_PROGRESS still persists.
It still produces the same issue: Only one process can be processed at a time. My processes are long running. Endpoint: j2j9d6odh3qsi3 / 1 worker set with Queue Delay as scaling strategy.
Note: I was using version 1.7.0 and no issue at all here.