Cancelling job resets flashboot
For some reason whenever we cancel a job the next time the serverless worker cold boots it doesn't use flash boot and instead it reloads the llm model weights into the gpu from scratch. Any idea why cancelling jobs might be causing this problem? Is there maybe a more graceful solution for stopping jobs early than the /cancel/{job_id} endpoint?
3 Replies
I'm not sure, maybe it refreshes the worker and makes unload from the vram
You can't stop jobs other than the /cancel API endpoint. I am also not sure whether /cancel would cause the worker to be refreshed. My understanding is that the worker is only refreshed if you specifally set
refresh_worker
to true in the handler response. It is not called for cancelling jobs as far as I am aware but probably need someone from RunPod to confirm.I can even observe this when cancelling a job through the web ui. While the worker is still active it will take jobs from the queue without refreshing, but as soon as it stops the next boot is refreshed.