Nafi
Nafi
Explore posts from servers
RRunPod
Created by Nafi on 10/30/2024 in #⛅|pods
Network Volume Integrity
Ever since last night every pod I deploy on my network volume: fpomddpaq0 there are certain files that I cannot open (I believe they have been corrupted). I get a 'launcher error 524' (timeout) when I try to open these specific files (.ipynb). I have tried changing images to the latest pytorch image but that did not help. I have cross checked with a fresh volume in the same region and the error does not occur there. I have now confirmed the issue using the file command via web terminal but it causes a timeout when trying to read those files, but not any other others. I am writing this post as those files had a lot of code that I will now have to rewrite from bits and pieces, a big waste of time. I am quite annoyed at this and am informing to prevent future incidents. For some additional context, I was running a CPU-intensive training and all of a sudden I was getting no response from the pod (there was a yellow exclamation warning on it on the pod deployments page) so after a while of waiting (an hour) I terminated the pod, and then when I tried to redeploy I couldn't (waiting for logs) so I slept on it and when I woke up the corruption was there.
3 replies
RRunPod
Created by Nafi on 6/29/2024 in #⚡|serverless
What is meant by a runner?
I have created my worker template and I am configuring GH actions. I am just unsure of what RUNNER_24GB is supposed to be, as to create a serverless endpoints require a container image but building and testing is the point of the CI/CD pipeline?
19 replies
RRunPod
Created by Nafi on 6/23/2024 in #⛅|pods
0 GPU pod makes no sense
I have network storage attached to my pods. I don't care if a GPU gets taken from me, but it's very inconvenient that I have to spinup a completely new pod when it does. I am automating runpod via the CLI, and at the moment I dont see any way to deploy a fresh instance and GET the ssh endpoint. I think just slapping on a warning saying you have to start fresh when a GPU gets taken and finding the next available one makes so much more sense, especially when using network storage.
87 replies