RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

GPU Host Registration

I run a LLM infra startup funded by a few top tier VCs - we built our own dedicated cluster for research, but have spare capacity that we would like to register as a host on runpod ( 16 H100s and 16L40S). Curious if someone could DM me about the process for hosting? I read on the website that I should be pinging in Discord. Thanks!

Help with constantly crashing GPU pods

Hello, I’ve been struggling for the past few days with trying to get a docker image up and running on a GPU pod. I had success with a template I made (docker image mcgillrobotics/mujoco:cuda118) and managed to connect and get things running, but since then I have not been able to successfully connect to a pod. The docker image pulls, but when I click “Connect to web terminal” nothing happens. When I try to SSH it says the container is not running and kicks me out instantly. I’ve tried different...
Solution:
Yes you need to add sleep infinity

[Urgent] failed : Software caused connection abort

Can someone help with this error please? it's causing us a huge problem with our next release. Trying to connect two different computers with pytorch and lightning via TCP ports. i have followed the directions that runpod advises for opening these ports (>70000): https://docs.runpod.io/pods/configuration/expose-ports pytorch and nccl appear to start opening the connection just fine and then we get an exception: ...

how to distribute usage of GPU

I purchased 2 RTX A5000 runpods. but my server is using only 1 GPU on 99%. other gpu is 0. is this wrong?...

Converting to Team Account

Hi, will my history usage and billing be public to my team if I convert my personal account into a team account?

terminal

Cant connect to web terminal or ssh to pod
No description

Compatibility of RTX A6000 for Multi-GPU Training

I would like to inquire about the types of GPUs that support multi-GPU training. For instance, is it possible to engage in multi-GPU training using 10 RTX A6000 cards from a previous generation? I understand that the H100 PCIe does not support multi-GPU training, while the H100 SXM5 does. Among the GPUs offered by RunPod, what other types of GPUs are capable of multi-GPU training?...

H100 multi-gpus settings

When I tried to load weights from checkpoints on my custom model using multi-gpus, weights are not loaded and the progress bar shows stop. I am using H100 x 7 on runpod, and when I did same trial on my local server (A6000 x 6), it worked well. Do you have any idea?...

Container fails to start randomly

Container fails to start randomly error pod id 840b98harmlsgb wqaz2xufma32pt eqyabu82t6l3y9...
No description

s3 slow upload

I'm currently working on uploading a dataset from S3 to my Pod using cloud sync. The dataset I'm uploading from S3 is about 1TB in size, so I set the volume size to 2TB. However, when I check the progress bar, it shows as follows: 314.897 GIB / 316.270 GIB 100% 13.045 MIB/S ETA 1M47S The upload seems to be progressing extremely slowly, and I can't estimate how long it will take. Could I be doing something wrong? I'm using 7 H100 GPUs, and the billing is adding up even though I haven't started working on the project yet. I would appreciate any quick help....

About the cost of container initialization phase

Hi, I have a little question, If it takes 5 to 10 minutes for my pod to start pulling the image, do I need to pay for this time?🤔 My custom image is so large that it takes a long time to pull it every time I create a pod...😔...

Broken CUDA / PyTorch on H100

/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
...

Cannot connect to pod, web UI stating "Network Issues", https://uptime.runpod.io/ showing all green

I am unable to continue work since it's not possible to connect to my pod over SSH, despite https://uptime.runpod.io/ is showing all green. "Just" using it for development right now but every hour of lost work time costs a lot. I topped up my account with $1000 recently and now I'm not so sure if that was a good idea. Will Runpod be a viable option for production?...
No description

Cannot connect to CPU pods

As titled. The HTTP service doesn't work, and I can't start a web terminal neither...

My pods are missing, but still charge me everyday

The management page shows empty, I can NOT find it, is it a bug?
No description

Network issue?

[Error -3]Temporary failure in name resolution

Pod running but inaccessible

Have 2 seperate pods connected to a network drive. Pod#1 is accessible and logs to a log1 file, Po2#2 logs to a log2 file. Pod#2 is not accessible via ssh and is stuck at the screenshot. But, have confirmed by accessing the pod#2 log via Pod#1 that it is still actually running.
No description

instances available A100 80GB

trying to deploy A100 80GB. i keep getting "there are no longer any instances available with enough disk space" no matter what container/volume disk sizes i set... have any of you run into this? if so, how did you get past it?
No description

https://www.runpod.io/console/pods keeps reordering servers

this is EXTREMELY infuriating. I keep accidentally deleting the wrong server because it reorders them for either no reason at all or when you start/stop one, etc

A1111 wont find my files

I'm trying to use the batch function in img2img. I've prepared all necessary folders and files. But, when I click on "run", nothing happens. It acts as if the path is wrong. I've tried all sorts of formatting, to put the folders in many different places, and nothing gave results....
Solution:
You used the correct directories in colab but not RunPod so obviously it works in colab.
No description