RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Can i still access the data of my GPU pod once my account run out of funds

I have a telegram bot running in a GPU pod. It has a postgres database container, it stores all the data in the postgres database. I earlier had it setup in the CPU pod but when i run out of funds, It deletes the pod and deletes all the data of my database. Now i switched to GPU pod so it has data persistency, but i was wondering what would happen if i run out of funds. Will i still be able to ssh in to my machine and get the data from my database ? or i can not do that....

Can I Sync Contabo storage

Contabo has a S3 compatible Object Storage same as Amazon S3. https://docs.contabo.com/docs/products/Object-Storage/s3-connection-settings I tried to sync using Amazon option but it didn't work since there is no place the change the end point. ...
No description

Save docker session

Hey so if i start a docker container as a GPU pod and then install something can i save the pod state as a new template?
Solution:
No. Create a docker image.

Frequent GPU problem with H100

Hello, I've seen that 9 times out of 10, I would get an H100 (PCIe) machine where in Cuda won't work with torch. For instance, this machine runs Cuda 12.2, but the torch-cuda integration is broken? @JM or someone from the RunPod team, can you please see since it's happening extremely frequently now? ID: b6d3hcqct79d7o, runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04...
Solution:
@Dhruv Mullick H100 PCIe have caused us lots of headaches lately. We are soon releasing a very powerful detection tool for the totality of RunPod servers, which will help us fix these non trivial issues. It seems it's always around some specific kernel version that might not be compatible even though it's supposed to be. That being said, expect a strong resolution in the near term!...
No description

OSError: [Errno 5] Input/output error

The model training stopped in the middle of the night with I/O error, apparently it is due to physical disk problem, and i tested it is randomly occur. the consequence is that make my pod idling for at least 6 hours, and i paid for it. 1. How to stop it happen again? 2. Can i claim it back for those idling hours? ...
No description

Error while running ComfyUI

When I use python main.py --listen command, I get this error Error handling request Traceback (most recent call last): File "/workspace/venv/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 350, in data_received...

GPU cloud storage GONE + billed for entire month

I'm seriously starting to think you guys have something against me. This is the second time you've lost my pod since I started casually using your service, but this time it's borderline illegal. I've been billed for the entire month of Feburary with no indication of when, or if you deleted my pod again. I specifically chose a different setup so it WOULD NOT get deleted like last time. ...
No description

Trying to create a Spot GPU instance leads to 400 response error

Greetings, I've recently noticed that whenever I try to create a GPU spot instance, I'm getting a 400 responses error. I'm trying to spin up a spot instance using the RunPod PyTorch 2.2.10 image, with 2x A6000, a network drive, in the Sweden datacenter. ...
Solution:
Yeah we pushed a fix just a little bit ago
No description

Where are all the U.S. network volume data centers?

It's been like this for at least a week. I'm in the U.S. There used to be Kansas options, but now I can only select Canada or Europe. Is there some kind of data center outage?...
No description

Managing multiple pod discovery

Hi, if I want to put a load balancer/queue system for multiple pods, is there some premade app I can use for that? I was thinking of something like Kubernetes but its not compatible with Podrun. Or is this not the use case of podrun?

How to withdraw money ?

How can i withdraw the money left on RunPod? I am done with my task and want to get the remaining money back

inconsistent speeds--community pod, any tips

Doing some upscales in SD forge--abit intensive 15mb image output, but ETA/UI seems to keep hanging, any troubleshooting tips? are community pods less stable?
No description

H100 PCIe and SXM stability issues

I have been working on 8xH100 PCIe. While intially working well, after some time they issue CUDA errors. Overall seems to be unstable. I always install transformer engine to enable FP8, maybe some incompatibility has come up. Then I got the chance to test a SXM system, but strangely with this one (a 6x) the whole process haltet just before training. I'm using axolotl for everything. ...

2024-03-01T16:08:54.761577365Z [FATAL tini (6)] exec docker failed: No such file or directory Error

Hey folks I'm having trouble running my image on Runpod. My image works properly on a normal root access Docker environment but doesn't work on the Runpod template structure. It's my first time on Runpod, can anyone help out?
No description

I want to install docker in a GPU pod.

I want to install docker in a GPU pod. Yes, I am aware that the pod itself is a docker container but I want to run 2 different docker containers in one pod.
Solution:
You can't run docker in docker on GPU pods.

OpenBLAS error

Hi, all. I got this error "OpenBLAS blas_thread_init: pthread_create failed for thread 3 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max" when running trials. I searched and found this (https://github.com/HumanSignal/label-studio/issues/3070), it says I need to add options to docker command. I would like to ask what is the default docker command, or how can I solve this question in general. Thanks!...

We have detected a critical error on this machine which may affect some pods.

Hey all. We're renting a number of H100s as a trial run of Runpod as we are looking for another compute provider. We paid for 24 hours of compute in order to transfer terabytes of data onto the machine, alongside paying for bandwidth and additional storage. We additionally paid our cloud provider egress costs, which is more than we paid for the H100 machine, and rented a disk & network optimized machine in order to transfer the data quickly to the Runpod machine. After 24 hours, we are getting this error on the Runpod GUI: We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime. ...
No description

Is it possible to restart the pod using manage Pod GraphQL API?

Is it possible to restart the pod using manage Pod GraphQL API?

Training for days

I want to train my model for days using a single GPU. How do I keep my Jupyter Notebook session to persist even after I close my laptop so that training continues?