Job suddenly restarts and fails after one retry.

I am trying desperately to get our custom LoRA training using koha_ss running on your serverless workers. After training a few epochs it suddenly stops/restarts. I already tried to adjust adjust timeout value via UI and the request. Here is some basic info about the request and response. I can provide you further details and logs via DM if you need more insights. Request: { "input": { "task": "train_lora", "job_id": "dev-test-12", "animal_type": "dog" }, "policy": { "executionTimeout": 3600000, "ttl": 86400000 }, "webhook": "https://webhook.site/xxx" } Respones: { "delayTime": 4179, "id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1", "retries": 1, "status": "IN_PROGRESS", "workerId": "s8gg7p09azjtqr" } { "delayTime": 394386, "error": "job timed out after 1 retries", "executionTime": 61170, "id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1", "retries": 1, "status": "FAILED", "workerId": "s8gg7p09azjtqr" }
6 Replies
nerdylive
nerdylive3mo ago
What is there any error? logs? i see there's a retry, hints something stopped the previous worker and retried the job again
landingpagelover24
landingpagelover24OP3mo ago
Hey @nerdylive! I cannot spot any errors in the logs attached – you can see how it suddenly restarts at epoch 4. Same request with less training images provided for testing just runs fine. Testing it locally with curl (via /runsync – because /run doesn't work due to a known bug) works fine.
nerdylive
nerdylive3mo ago
is it abnormal? i don't know why it does that sorry, what do you think might cause that retry tho have you tried larger vram Gpu's? ( im guessig its oom but no logs showing anything)
landingpagelover24
landingpagelover24OP3mo ago
I think I can rule out OOM because it runs on my local machine with 12GB of VRAM (with exactly the same settings) while the worker GPU has 24GB. Also, RunPod UI shows that only a fraction of worker resources are being used (this is something I should definitely optimize after fixing the current issue lol).
landingpagelover24
landingpagelover24OP3mo ago
No description
nerdylive
nerdylive3mo ago
Hmm okay nice, but to debug the problem you must do some steps to get some context about what's making the worker stops Try monitoring, logs
Want results from more Discord servers?
Add your server