Job retry after successful run
My endpoint started to have retries for every request even though the first run is successful without any errors. Don't understand why that is happening.
That is what I see in the logs when first run finishes, and retry starts
2024-10-10T11:51:52.937738320Z {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
2024-10-10T11:51:52.972812780Z {"requestId": "e5746a57-2af3-4849-84d1-b58d24480627-e1", "message": "Finished.", "level": "INFO"}
2024-10-10T11:51:52.972908181Z {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
2024-10-10T11:51:52.973024343Z {"requestId": "e5746a57-2af3-4849-84d1-b58d24480627-e1", "message": "Started.", "level": "INFO"}
6 Replies
seems like turning off flashboot solved the problem, but not sure, maybe just coincidence
For me, upgrading the SDK from 1.7.1 to ,1.7.2 got rid of the retries
thanks, i'll try
same issue
how to resolve this? I am using 1.7.2 and turned off flashboot
you mentioned runpod cli sdk? because Im not using any cli, just deploy to serverless in dashboard, and I don't see any sdk selection (my container image: nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu20.04)
Has anyone found a fix to the issue? I also get successful runs, but immediately after, the job retries and the worker subsequently fails:
2024-10-20 19:24:33.267
[cc1g0dj5wo63pu]
[error]
Failed to return job results. | 400, message='Bad Request'