R
RunPod4mo ago
rougsig

Stuck IN_PROGRESS but job completed and worker exited

{
"delayTime": 11461,
"executionTime": 35548,
"id": "8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1",
"output": [
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/8b33f799.png",
"seed": 3878033049
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2372e60a.png",
"seed": 3878033050
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2bb5e600.png",
"seed": 3878033051
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2c4b435b.png",
"seed": 3878033052
}
],
"status": "IN_PROGRESS",
"workerId": "2n0sjjtsxyclx8"
}
{
"delayTime": 11461,
"executionTime": 35548,
"id": "8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1",
"output": [
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/8b33f799.png",
"seed": 3878033049
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2372e60a.png",
"seed": 3878033050
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2bb5e600.png",
"seed": 3878033051
},
{
"blob_name": "09-24/8612f7b4-df33-4be9-8ce6-1a82b7283b24-e1/2c4b435b.png",
"seed": 3878033052
}
],
"status": "IN_PROGRESS",
"workerId": "2n0sjjtsxyclx8"
}
No description
15 Replies
rougsig
rougsigOP4mo ago
And webhook does not send. I'm using Generator Handler (no async)
nerdylive
nerdylive4mo ago
Thats weird, whats your code like
Mihály
Mihály4mo ago
I'm having the same issue. SDK 1.6.2, 1.7.0 and 1.7.1 all produces this, however very rarely. Sometimes 1 our of 6, sometimes 1 out of 30. Re-submitting the same payload can run without issues a second time.
Mihály
Mihály4mo ago
No description
No description
Mihály
Mihály4mo ago
I dont know if that helps, but the 1.7.1 is more verbose with the errors as well :
"{"trace_id": "4453ebf4-f262-4e44-a422-d6f3691ac250", "request_id": "7b2b6fc3-2968-407c-bb29-97a14b07f238-e1", "user_agent": "RunPod-Python-SDK/1.7.1 (Linux 6.2.0-34-generic; x86_64) Language/Python 3.10.12", "start_time": "2024-09-29T15:54:28.322649+00:00", "method": "GET", "url": "https://api.runpod.ai/v2/74jm2u3liu0pcy/job-take/li0g2epzy6h0eu?gpu=NVIDIA GeForce RTX 4090&job_in_progress=0", "mode": "async", "connect": 0.2, "payload_size_bytes": 0, "exception": "", "transfer": 812777.8, "end_time": "2024-09-29T16:08:01.100656+00:00", "total": 812778.0}"
"{"trace_id": "4453ebf4-f262-4e44-a422-d6f3691ac250", "request_id": "7b2b6fc3-2968-407c-bb29-97a14b07f238-e1", "user_agent": "RunPod-Python-SDK/1.7.1 (Linux 6.2.0-34-generic; x86_64) Language/Python 3.10.12", "start_time": "2024-09-29T15:54:28.322649+00:00", "method": "GET", "url": "https://api.runpod.ai/v2/74jm2u3liu0pcy/job-take/li0g2epzy6h0eu?gpu=NVIDIA GeForce RTX 4090&job_in_progress=0", "mode": "async", "connect": 0.2, "payload_size_bytes": 0, "exception": "", "transfer": 812777.8, "end_time": "2024-09-29T16:08:01.100656+00:00", "total": 812778.0}"
"payload_size_bytes": 0 <-- seems sus?
nerdylive
nerdylive4mo ago
Guys try to open a support ticket for this issue and give the example code too Do you all use .progress_update() by any chance?
Mihály
Mihály4mo ago
On my side, it was added to the code after the above issue started happening. Didn't affect the outcomes
deanQ
deanQ4mo ago
We recently fixed a bug and released it on 1.7.2. The bug caused our platform to disregard workers that are currently working a job. So if a job took longer than an endpoint's idle time (for example) it would put that worker to sleep. By the time the job is finished, it would have no worker to report back to.
Mihály
Mihály4mo ago
Hey @deanQ I've upgraded to 1.7.2 (5-6 hours ago) but still getting stuck jobs the same way. 😦
deanQ
deanQ4mo ago
I have looked at the logs of your endpoint 74jm2u3liu0pcy. It still says it's using 1.6.2 all week. Could it be a different endpoint ID?
Mihály
Mihály4mo ago
Yeah, its noxhy2en39n3y3 my dev endpoint.
deanQ
deanQ4mo ago
not the best of logs. That field actually refers to the size of the body payload on post or put requests. Get requests have none.
Mihály
Mihály4mo ago
I'm not sure i follow 😄
deanQ
deanQ4mo ago
I was referring to this "payload_size_bytes": 0 <-- seems sus? It's going to always be zero for GET requests. Payload only exists for post or put requests.
Mihály
Mihály4mo ago
Ah, makes sense!

Did you find this page helpful?