Job suddenly restarts and fails after one retry.
I am trying desperately to get our custom LoRA training using koha_ss running on your serverless workers. After training a few epochs it suddenly stops/restarts.
I already tried to adjust adjust timeout value via UI and the request. Here is some basic info about the request and response. I can provide you further details and logs via DM if you need more insights.
Request:
{
"input": {
"task": "train_lora",
"job_id": "dev-test-12",
"animal_type": "dog"
},
"policy": {
"executionTimeout": 3600000,
"ttl": 86400000
},
"webhook": "https://webhook.site/xxx"
}
Respones:
{
"delayTime": 4179,
"id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1",
"retries": 1,
"status": "IN_PROGRESS",
"workerId": "s8gg7p09azjtqr"
}
{
"delayTime": 394386,
"error": "job timed out after 1 retries",
"executionTime": 61170,
"id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1",
"retries": 1,
"status": "FAILED",
"workerId": "s8gg7p09azjtqr"
}
6 Replies
What is there any error? logs?
i see there's a retry, hints something stopped the previous worker and retried the job again
Hey @nerdylive!
I cannot spot any errors in the logs attached – you can see how it suddenly restarts at epoch 4.
Same request with less training images provided for testing just runs fine.
Testing it locally with curl (via /runsync – because /run doesn't work due to a known bug) works fine.
is it abnormal? i don't know why it does that sorry, what do you think might cause that retry tho
have you tried larger vram Gpu's? ( im guessig its oom but no logs showing anything)
I think I can rule out OOM because it runs on my local machine with 12GB of VRAM (with exactly the same settings) while the worker GPU has 24GB. Also, RunPod UI shows that only a fraction of worker resources are being used (this is something I should definitely optimize after fixing the current issue lol).
Hmm okay nice, but to debug the problem you must do some steps to get some context about what's making the worker stops
Try monitoring, logs