R
RunPod•4mo ago
warhol

How to Estimate the Survival Time of Spot Instances?

I need some advice on estimating the survival time of RunPod Spot instances. I've noticed that sometimes my Spot instances run for several hours without interruption, while other times they get terminated within minutes. This variability makes it challenging to choose between SPOT and ON-DEMAND.
9 Replies
nerdylive
nerdylive•4mo ago
No estimate of survival time, demand is unpredictable from other customers sorry.. yes, cause if you're using spot, you need to like handle the exit termination signals on the linux then move to other pod etc... to keep your work running smoother or use something like skypilot, or some other library-> orchestration for cloud gpus to make that easier
warhol
warhol•4mo ago
Yeah. Agreed that demand from other users is unpredictable. I was hoping that there are some statistic algorithms which predict the survial time. skypilot will be a great tool if I am going to run some batch jobs. thanks for the suggestion. meanwhile sometimes I use runpod machine as my workstation as well.
nerdylive
nerdylive•4mo ago
OhI think there is no, for now unless you collect it yourself like by time of day, date
warhol
warhol•4mo ago
hah. yeah. that might be an solution. I really love network volumn provided by runpod. It makes using RunPod as a daliy workstation possible. Usually I run a pod for about several hours. If I can select a feasible SPOT price which make a RunPod survive for about 2 hours in average, then it will be perfect. Can I at least receive a signal inside the container when the SPOT instance being killed and allow me one second to log necessary states to the volume?
yhlong00000
yhlong00000•4mo ago
Spot Pods use spare compute capacity, allowing you to bid for those compute resources. Resources are dedicated to your Pod, but someone else can bid higher or start an On-Demand Pod that will stop your Pod. When this happens, your Pod is given a signal to stop 5 seconds prior with SIGTERM, and eventually, the kill signal SIGKILL after 5 seconds. You can use volumes to save any data to the disk in that 5s period or push data to the cloud periodically. https://docs.runpod.io/references/faq/#on-demand-vs-spot-pod
FAQ | RunPod Documentation
RunPod offers two cloud computing services: Secure Cloud and Community Cloud. Secure Cloud provides high-reliability, while Community Cloud offers peer-to-peer GPU computing. On-Demand Pods run continuously, while Spot Pods use spare compute capacity.
nerdylive
nerdylive•4mo ago
try on demand if you don't want the termination like the spot haha its not really enough to save files, most likely takes time more than that
yhlong00000
yhlong00000•4mo ago
Time is money 😆
warhol
warhol•4mo ago
😉 And I am planning to use the 5 seconds to log the status so that I can resume on another machine without pain.
nerdylive
nerdylive•4mo ago
yeah sure that might work
Want results from more Discord servers?
Add your server