Support for terminating pods via SkyPilot
Hi, I want to let my training runs go overnight and to terminate the pod once they are finished training. To do this, I am currently using SkyPilot. Whenever I try and stop a pod via SkyPilot, I get an error similar to
Stopping is currently not supported for RunPod
. Can RunPod please support this feature?11 Replies
It would also be useful to be able to set
image_id
so I can use the template runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
instead of the default which has an old version of cuda
CC @LukeIf you’re using a network volume, there’s no need for a “stop” option since all your data is stored in the network volume. You can safely terminate the pod without losing data.
Regarding your second question, I didn’t quite follow. When modifying the template, you can specify any docker image you prefer.
I am trying to terminate the pod via CLI using the SkyPilot integration, but I get an error that its not supported.
Same for the template, I want to set it via CLI using SkyPilot, but get an error that its not supported.
I am trying to build off of this tutorial, using the features in SkyPilot:
https://docs.runpod.io/tutorials/integrations/skypilot
Running RunPod on SkyPilot | RunPod Documentation
SkyPilot is a framework for executing LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.
is this solved yet?
it is not
i wonder how your yml filelooks like (skypilot)
I can post it later today, the main thing that differs is I specify ‘image_id’ to try and get a torch 2.4 template, but it says its not supported with runpod
working backwards, is there any docs on specifying a template on skypilot with runpod? Is there any way to auto terminate a pod when its idle (ie training run ends)?
I think on skypilot docs? (not sure, I haven't checked )
Thats what I used :p
I just dont think runpod supports these features in the integration
Yeah maybe, it hasn't been added to skypilot yet