0 GPU pod makes no sense
I have network storage attached to my pods. I don't care if a GPU gets taken from me, but it's very inconvenient that I have to spinup a completely new pod when it does. I am automating runpod via the CLI, and at the moment I dont see any way to deploy a fresh instance and GET the ssh endpoint. I think just slapping on a warning saying you have to start fresh when a GPU gets taken and finding the next available one makes so much more sense, especially when using network storage.
43 Replies
0 GPU is not a thing when you use network volumes, that only happens when you don't use a network volume
It's happened several times to me, with a network volume attached
I assure you this because the data is persistent
when you use network storage, you can't stop instances so that means it goes back to the pool and you will just select an available ones in the same datacenter when you want to start another instance.
Wait what do you actually need again?
essentially I want to be able to start an exited pod, but I dont care if the GPU returns to the pool I am happy to use the next available one. The issue is that every so often the instance can only have 0 GPU's so I have to redeploy a completely new pod on the network storage. I cannot do this via the CLI so I have to do it manually, which defeats the purpose of the automation.
Here's a sample response when trying to start the exited pod via the CLI:
Next time it happens I can send a screenshot of the runpod UI
You can't stop pods with network storage can you?
how do you stop pods?
runpodctl stop pod <id>
Yeah just terminate it instead?
and create a new one
can be done via CLI?
yes
and create new one via CLI?
yes
yes I see now. I will try thanks
np
what am I missing here?
Error: There are no longer any instances available with the requested specifications. Please refresh and try again.
There's no instance with your filters then
change cost, gpu type
tried removing cost, and setting GPU type to "L40"
i copied the IDs directly
I will double check
yeah nothing, Im unsure why the
imageName
has to be specified if you can specify the template?Im not sure hahah
is it actually required?
yes
Error: required flag(s) "imageName" not set
oooh
maybe if you want to ask that try to contact support from the website's contact page
or put a feedback #š§ļ½feedback if you wish to request remove it
imageName your docker image name
I mean there's templates ( which contains the image tag too )
?
runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
^^^^^
This doesn't happen though. The GPU I use is never unavailable, it's just like it gets taken from me and theres no option to fetch another from the pool. I raised an issue about the imageName
and they are handling it internally, but that isn't even necessary for me, as long as I can pool a new GPU rather than having 0 GPUs on my pod, even with a network storage attached.What do you mean?
did you tried to stop pods and it works?
and you can't create another pod?
this would say 0xL40
even though I have a network volume attached to it
stopping is fine, its starting it
hey
*
Tried that
then?
Do you have an exact command that work for you (CLI) for create pod
I tried the issue occured with the
imageName
they raised it internally
Sample:
Output:
**yeah just try creating it from the UI, or graphql for now. they're fixing that
graphql would be fantastic, is there documentation?
found it
graphql didnt fix the problem :/
For this input:
Output:
For this input:
Output:
Confirmed an internal problem then
Its not an internal problem, its a problem with your request. You can't specify a networkVolumeId without a data center id.
The UX generally leaves something to be desired when it comes to provisioning and terminating resources for whichever reason
The fact that they don't the pod in a terminated state when they shut it down is really frustrating, as it leaves you no recourse when they decide to kill an instance you are using, because you balance is nearing $0.
No warning, no idle state, no buffer at all
Not sure what you're referring to but sounds like a different issue to this thread
It's super issue to this
"They don't the pod" what does it means
They? What about me?
Ive had my pods vanish with the blink of an eye. While I was working on it.
Did you have 0 balance?
there is auto top up feature in billing
Youre missing the point, but Im too far away to hand it to you
Thereās almost always a way to work around UX issues
But those are judt work-arounds, not solutions.
They have "buffer" on the pods, its called signals on Linux if I'm not wrong
I'm not sure what's your point, mind explaining?
This is similar but also completely different
I would open a separate thread and describe your issue more in depth
The way to implement what you desire would be some sort of network storage system caching
/frozen system state or something
Theres probably a reason why they havenāt done it yet - too complicated/not possible with the infrastructure
The original issue I raised in this thread can only be avoided by creating and deleting pods on demand via graphql