landingpagelover24
landingpagelover24
RRunPod
Created by landingpagelover24 on 10/2/2024 in #⚡|serverless
Job suddenly restarts and fails after one retry.
I am trying desperately to get our custom LoRA training using koha_ss running on your serverless workers. After training a few epochs it suddenly stops/restarts. I already tried to adjust adjust timeout value via UI and the request. Here is some basic info about the request and response. I can provide you further details and logs via DM if you need more insights. Request: { "input": { "task": "train_lora", "job_id": "dev-test-12", "animal_type": "dog" }, "policy": { "executionTimeout": 3600000, "ttl": 86400000 }, "webhook": "https://webhook.site/xxx" } Respones: { "delayTime": 4179, "id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1", "retries": 1, "status": "IN_PROGRESS", "workerId": "s8gg7p09azjtqr" } { "delayTime": 394386, "error": "job timed out after 1 retries", "executionTime": 61170, "id": "9fd09c57-dea4-4ea6-b30b-2c77ed4bd35b-e1", "retries": 1, "status": "FAILED", "workerId": "s8gg7p09azjtqr" }
10 replies