Stuck IN_PROGRESS but job completed and worker exited

{
"delayTime": 11461,
"executionTime": 35548,
"id": "8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1",
"output": [
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/8b33f799.png",
"seed": 3878033049
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2372e60a.png",
"seed": 3878033050
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2bb5e600.png",
"seed": 3878033051
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2c4b435b.png",
"seed": 3878033052
}
],
"status": "IN_PROGRESS",
"workerId": "2n0sjjtsxyclx8"
}
{
"delayTime": 11461,
"executionTime": 35548,
"id": "8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1",
"output": [
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/8b33f799.png",
"seed": 3878033049
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2372e60a.png",
"seed": 3878033050
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2bb5e600.png",
"seed": 3878033051
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2c4b435b.png",
"seed": 3878033052
}
],
"status": "IN_PROGRESS",
"workerId": "2n0sjjtsxyclx8"
}
No description
15 Replies
rougsig
rougsig4w ago
And webhook does not send. I'm using Generator Handler (no async)
nerdylive
nerdylive4w ago
Thats weird, whats your code like
Mihály
Mihály4w ago
I'm having the same issue. SDK 1.6.2, 1.7.0 and 1.7.1 all produces this, however very rarely. Sometimes 1 our of 6, sometimes 1 out of 30. Re-submitting the same payload can run without issues a second time.
Mihály
Mihály4w ago
No description
No description
Mihály
Mihály4w ago
I dont know if that helps, but the 1.7.1 is more verbose with the errors as well :
"{"trace_id": "4453ebf4-f262-4e44-a422-d6f3691ac250", "request_id": "7b2b6fc3-2968-407c-bb29-97a14b07f238-e1", "user_agent": "RunPod-Python-SDK/1.7.1 (Linux 6.2.0-34-generic; x86_64) Language/Python 3.10.12", "start_time": "2024-09-29T15:54:28.322649+00:00", "method": "GET", "url": "https://api.runpod.ai/v2/74jm2u3liu0pcy/job-take/li0g2epzy6h0eu?gpu=NVIDIA GeForce RTX 4090&job_in_progress=0", "mode": "async", "connect": 0.2, "payload_size_bytes": 0, "exception": "", "transfer": 812777.8, "end_time": "2024-09-29T16:08:01.100656+00:00", "total": 812778.0}"
"{"trace_id": "4453ebf4-f262-4e44-a422-d6f3691ac250", "request_id": "7b2b6fc3-2968-407c-bb29-97a14b07f238-e1", "user_agent": "RunPod-Python-SDK/1.7.1 (Linux 6.2.0-34-generic; x86_64) Language/Python 3.10.12", "start_time": "2024-09-29T15:54:28.322649+00:00", "method": "GET", "url": "https://api.runpod.ai/v2/74jm2u3liu0pcy/job-take/li0g2epzy6h0eu?gpu=NVIDIA GeForce RTX 4090&job_in_progress=0", "mode": "async", "connect": 0.2, "payload_size_bytes": 0, "exception": "", "transfer": 812777.8, "end_time": "2024-09-29T16:08:01.100656+00:00", "total": 812778.0}"
"payload_size_bytes": 0 <-- seems sus?
nerdylive
nerdylive4w ago
Guys try to open a support ticket for this issue and give the example code too Do you all use .progress_update() by any chance?
Mihály
Mihály4w ago
On my side, it was added to the code after the above issue started happening. Didn't affect the outcomes
deanQ
deanQ2w ago
We recently fixed a bug and released it on 1.7.2. The bug caused our platform to disregard workers that are currently working a job. So if a job took longer than an endpoint's idle time (for example) it would put that worker to sleep. By the time the job is finished, it would have no worker to report back to.
Mihály
Mihály2w ago
Hey @deanQ I've upgraded to 1.7.2 (5-6 hours ago) but still getting stuck jobs the same way. 😦
deanQ
deanQ2w ago
I have looked at the logs of your endpoint 74jm2u3liu0pcy. It still says it's using 1.6.2 all week. Could it be a different endpoint ID?
Mihály
Mihály2w ago
Yeah, its noxhy2en39n3y3 my dev endpoint.
deanQ
deanQ2w ago
not the best of logs. That field actually refers to the size of the body payload on post or put requests. Get requests have none.
Mihály
Mihály2w ago
I'm not sure i follow 😄
deanQ
deanQ2w ago
I was referring to this "payload_size_bytes": 0 <-- seems sus? It's going to always be zero for GET requests. Payload only exists for post or put requests.
Mihály
Mihály2w ago
Ah, makes sense!
Want results from more Discord servers?
Add your server