landingpagelover24
RRunPod
•Created by landingpagelover24 on 10/2/2024 in #⚡|serverless
Job suddenly restarts and fails after one retry.
I am trying desperately to get our custom LoRA training using koha_ss running on your serverless workers. After training a few epochs it suddenly stops/restarts.
I already tried to adjust adjust timeout value via UI and the request. Here is some basic info about the request and response. I can provide you further details and logs via DM if you need more insights.
Request:
{
"input": {
"task": "train_lora",
"job_id": "dev-test-12",
"animal_type": "dog"
},
"policy": {
"executionTimeout": 3600000,
"ttl": 86400000
},
"webhook": "https://webhook.site/xxx"
}
Respones:
{
"delayTime": 4179,
"id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1",
"retries": 1,
"status": "IN_PROGRESS",
"workerId": "s8gg7p09azjtqr"
}
{
"delayTime": 394386,
"error": "job timed out after 1 retries",
"executionTime": 61170,
"id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1",
"retries": 1,
"status": "FAILED",
"workerId": "s8gg7p09azjtqr"
}
10 replies