RunPod•7mo ago

Job suddenly restarts and fails after one retry.

I am trying desperately to get our custom LoRA training using koha_ss running on your serverless workers. After training a few epochs it suddenly stops/restarts. I already tried to adjust adjust timeout value via UI and the request. Here is some basic info about the request and response. I can provide you further details and logs via DM if you need more insights. Request: { "input": { "task": "train_lora", "job_id": "dev-test-12", "animal_type": "dog" }, "policy": { "executionTimeout": 3600000, "ttl": 86400000 }, "webhook": "https://webhook.site/xxx" } Respones: { "delayTime": 4179, "id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1", "retries": 1, "status": "IN_PROGRESS", "workerId": "s8gg7p09azjtqr" } { "delayTime": 394386, "error": "job timed out after 1 retries", "executionTime": 61170, "id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1", "retries": 1, "status": "FAILED", "workerId": "s8gg7p09azjtqr" }

6 Replies

Jason•7mo ago

What is there any error? logs? i see there's a retry, hints something stopped the previous worker and retried the job again

landingpagelover24OP•7mo ago

Hey @nerdylive! I cannot spot any errors in the logs attached – you can see how it suddenly restarts at epoch 4. Same request with less training images provided for testing just runs fine. Testing it locally with curl (via /runsync – because /run doesn't work due to a known bug) works fine.

logs_d.txt

Jason•7mo ago

is it abnormal? i don't know why it does that sorry, what do you think might cause that retry tho have you tried larger vram Gpu's? ( im guessig its oom but no logs showing anything)

landingpagelover24OP•7mo ago

I think I can rule out OOM because it runs on my local machine with 12GB of VRAM (with exactly the same settings) while the worker GPU has 24GB. Also, RunPod UI shows that only a fraction of worker resources are being used (this is something I should definitely optimize after fixing the current issue lol).

landingpagelover24OP•7mo ago

Jason•7mo ago

Hmm okay nice, but to debug the problem you must do some steps to get some context about what's making the worker stops Try monitoring, logs

Gaming

Programming

Job suddenly restarts and fails after one retry.

Did you find this page helpful?