R
RunPod4w ago
MD300

Pod / GPU stopped working

I've had a pod running continiously for over a week. Today it just stopped working. Everything on the surface looks ok, but Ollama (the tool I'm using) won't run any model now. The error is saying not enough resources. It looks like the GPU is not working. I've tried reinstalling Ollama, restarting the pod and terminating all processes. Nothing works. Is the possible the GPU might not be available even though it says it is and I'm paying for it?
24 Replies
MD300
MD300OP4w ago
It seems to be back. I should have posted this earlier (seemed to have issues for about half the day)
nerdylive
nerdylive4w ago
if you got any error logs, copy and paste it here and search it up on google to see whats going on
MD300
MD300OP4w ago
Ok, thanks. Now the pod won't restart. So it seems something was wrong with the pod since yesterday after (when I started to have issues with Ollama, and everything was much slower than normal). Purely out of principal, do you know if we have any method to claim refunds on time paid when a pod was functioning correctly? It may or may not be related, but I also had this: curl https://ollama.ai/install.sh | s bash: s: command not found So the Ollama update was failing. It had worked previously. Do you think the pod is fixable, and if so, who would I ask (I've never managed to get a response from support)
nerdylive
nerdylive4w ago
You mean when a pod isn't functioning correctly? Support ticket Why it can't restart? What happens It's sh not s You'll need to wait for a response in your email Maybe they're busy
MD300
MD300OP4w ago
Thanks for noticing the missing 'h'. It actually failed with the correct command, but not with that error (I didn't copy it at the time). I'll come back on that one. So on restart, for about 20 minutes after I couldn't open the web terminal. That seems to work now. The main issue started sometime yesterday late morning (my timezone), suddenly models got very slow (Ollama run <modelname> was timing out, which I had not seen before). That's when I first restarted, reinstalled Ollama (which worked then), but still very slow. Re, support, not had a reply yet after two attempts (that was last week). Tried a third time just before writing here. But in fairness, slthough you and a couple of others have been great help, you can see RunPod have pushed support to the community, with no real intent to do much more (it's not an unusual approach, but very odd for a paid service that amounts to $100s a month). It's ONLY this channel and a couple of you that meant I stayed. But I actively look for alternatives every day. Wearing a different hat of mine, RunPod are missing a huge opportunity to be top of the pile, give the plus sides of what they have.
nerdylive
nerdylive4w ago
Woah really... Mustve been very busy. I hope you can bear with this sorry Have I escalated once from your message Couldn't open web terminal, what is it like? No connect button?
MD300
MD300OP4w ago
Re. the issue, with the correct install command it worked, and I have a line in the results that I've not seen before: 'NVIDIA GPU installed.' Is it possible that somehow yesterday GPU software / driver stopped or got corrupted (I wouldn't know how to uninstall it)? Or is this not an unusual message? Re. web termanal, I've slick Start, there was animation, but it wasn't showing the connect option. It is now. I've been using it with no browser restrictions or blockers (one thing I do is SaaS consultantcy, so always turn that studd off for dashboards, as can br problematic) So seems I cannot now do anything. Just tried to pull a model (same one I pulled before, from huggingface), and now get this: Error: max retries exceeded: write /workspace/blobs/sha256-a2a70f064765a8161ab76dc053b07acae14bc14056c0852550c1608724d60815-partial: input/output error Pod is empty (nothing else installed). Do you think I have a hope here of continuing with this pod?
nerdylive
nerdylive4w ago
I'm not sure I don't use ollama haha Ah perhaps it was loading Try to delete that blobs folder or just that folder that errored
MD300
MD300OP4w ago
I tried that. Each thing I do or try sems to give me another error. Ollam won't stay running. Restarting now starts without Ollama running (which it did't do before). I guess pod is lost. Very annoying as I'm now behind on a deadline, and setting this all up again is a bit crazy. Trying Tensordock and Vast to see if I can get something there around similar price (if I have to buy a new pod and set up again, with no support here to look into pod, makes no sense to stay if alternative). Spent $100s on this pod as it's been on 24/7 Not your fault!!! Just very frustrating
nerdylive
nerdylive4w ago
What error is it now
MD300
MD300OP4w ago
Ollama won't stay running. I can run it with Ollama serve, open a new terminal and use it, but that is not persisting. Plus pod has really struggled pullign a model. Very slow, kept stopping with input/output errors. I don't believe pod is ok. But not much I can do
nerdylive
nerdylive4w ago
its because terminal closes when you disconnect so either you run it from your dockerfile ( custom image) or if you dont make a new custom image and make a newtemplate in runpod: you use tmux to keep the terminal session inside tmux running tmux is basically a terminal session that can run even tho you disconnect
MD300
MD300OP4w ago
yes, but Ollama should just be running.
It just feels wrong to have to but a new pod, set it all up again, after this one was on for last week 24/7, for the whole reason of saving time
nerdylive
nerdylive4w ago
which datacenter are you using?
MD300
MD300OP4w ago
How can I tell? It's a secure one, EUR SE
nerdylive
nerdylive4w ago
eur se 1?
nerdylive
nerdylive4w ago
when you create a pod
No description
nerdylive
nerdylive4w ago
what a coincidence https://discord.com/channels/912829806415085598/1185333521124962345/1328343808806617149 there are guys too experiencing the same issue, slow network storage https://discord.com/channels/912829806415085598/1328344920678858916 or maybe slow internet
MD300
MD300OP4w ago
that would explain a lot, and noticed it from yesterday. That is frustring. Thanks for highlighting this. Good to know. RunPod should be refunding on some the charges when it's per hour, and clearly people paying for something that doesn't work I'm going to look for some emails of head honchos.....
nerdylive
nerdylive4w ago
yeah if its not a user error, feel free to create a support ticket to ask for refunds what is honchos
MD300
MD300OP4w ago
... bosses, I think It's just sad to see a company with an in-demand servce, that's not cheap, not putting more effort into directly solving paying-client issues. And for a datacentre to have issues for a day or more, that's crazy. Can't be a real data centre. I came from the web dev world, where issues like this were resolved in minutes, or at least hours. I'm updating for balance and to share. So had a back and forth with support. It went well. The result was good, fair and I'm good with the situation now.
nerdylive
nerdylive4w ago
Nicee
Vincent
Vincent2w ago
Hello guys, just set up a new pod yesterday and install couple packages in this new instance, i turn it down (not restart) at the end the day and when turning it on this morning the pod was empty (disk usage 0%) please reassure me when stop and not reset your pod should keep in memory the work done ?
nerdylive
nerdylive2w ago
only in /workspace will be saved when you stop pod as written when you stop the pod.. it has notice for that right?

Did you find this page helpful?