worker keeps dying while training a lora model
even after setting the worker to be active, it keeps dying after like 2 minutes. is there a way to prevent this?
8 Replies
Hmm yeah i wonder if this is normal, and idle timeout seems not to work, being active as supposed to
removing execution timeout fixed it
@Tim aka NERDDISCO this maybe a bug in runpod
@shawtyisaten would you mind providing the endpoint id or some more info about the used docker image?
I'm not sure if it's a bug. i think it worked as intended as i set the execution timeout. endpoint id is z398ywur6g1041. docker image is custom one i made for training a flux lora model.
i just thought it was unexpected because i don't remember checking that box. i think it's checked by default when you create a worker
This behavior is intentional. The execution timeout is designed to prevent a worker from running indefinitely, which could happen if there’s a bug in the code or a long-running process that could potentially drain all your credits.
ohh
@yhlong00000 thanks for the clarification!