RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡｜serverless

⛅｜pods

suzakukururugi000

2/29/2024

Training for days

I want to train my model for days using a single GPU. How do I keep my Jupyter Notebook session to persist even after I close my laptop so that training continues?

panos.firbas

2/29/2024

Disk reading unacceptably and mind boggingly slow

I thought I had figured out where to put my data: pay for extra disk space in / move the data from /workspace (which is a network drive) to / and from there it can be read fast enough. But today, I tried the same thing in this pod: ...

Ryan

2/29/2024

"Pricing error for savings plan"

Website says "Pricing error for savings plan" when I try to create a savings plan for my A6000 server. Both 3 month and 6 month plans just give an error message and I'm unable to create a savings plans.

ifelif_

2/29/2024

/workspace not writable

When I turned off the pod (ID: n2srovqha2mlj5), everything was working. I turned it on and I can no longer write to /workspace ``` $ echo test > /workspace/file $ cat /workspace/file $...

disintegral

2/29/2024

Tokenizer error

OSError: Can't load tokenizer for 'openai/clip-vit-large-patch14'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'openai/clip-vit-large-patch14' is the correct path to a directory containing all relevant files for a CLIPTokenizer tokenizer. Can someone help me? This suddenly appeared. Running the normal Stable Diffusion template. Images fail to generate....

Abhi

2/28/2024

How to use the comfyui API when running it inside Runpod GPU pods

I can use the UI running on port 3000 using the template runpod/stable-diffusion:comfy-ui-5.0.0 but I am not able to call the API is there any documentation or examples for this scenario. I am using this example code top call the API https://github.com/comfyanonymous/ComfyUI/blob/master/script_examples/basic_api_example.py Please help....

c23p

2/28/2024

GPU Host Registration

I run a LLM infra startup funded by a few top tier VCs - we built our own dedicated cluster for research, but have spare capacity that we would like to register as a host on runpod ( 16 H100s and 16L40S). Curious if someone could DM me about the process for hosting? I read on the website that I should be pinging in Discord. Thanks!

Antoine Dangeard

2/28/2024

Help with constantly crashing GPU pods

Hello, I’ve been struggling for the past few days with trying to get a docker image up and running on a GPU pod. I had success with a template I made (docker image mcgillrobotics/mujoco:cuda118) and managed to connect and get things running, but since then I have not been able to successfully connect to a pod. The docker image pulls, but when I click “Connect to web terminal” nothing happens. When I try to SSH it says the container is not running and kicks me out instantly. I’ve tried different...

Solution:

Yes you need to add sleep infinity

bitcurrent

2/27/2024

[Urgent] failed : Software caused connection abort

Can someone help with this error please? it's causing us a huge problem with our next release. Trying to connect two different computers with pytorch and lightning via TCP ports. i have followed the directions that runpod advises for opening these ports (>70000): https://docs.runpod.io/pods/configuration/expose-ports pytorch and nccl appear to start opening the connection just fine and then we get an exception: ...

Robbie

2/27/2024

how to distribute usage of GPU

I purchased 2 RTX A5000 runpods. but my server is using only 1 GPU on 99%. other gpu is 0. is this wrong?...

PikaZ

2/27/2024

Converting to Team Account

Hi, will my history usage and billing be public to my team if I convert my personal account into a team account?

moco

2/27/2024

terminal

Cant connect to web terminal or ssh to pod

AutoK

2/27/2024

Compatibility of RTX A6000 for Multi-GPU Training

I would like to inquire about the types of GPUs that support multi-GPU training. For instance, is it possible to engage in multi-GPU training using 10 RTX A6000 cards from a previous generation? I understand that the H100 PCIe does not support multi-GPU training, while the H100 SXM5 does. Among the GPUs offered by RunPod, what other types of GPUs are capable of multi-GPU training?...

AutoK

2/27/2024

H100 multi-gpus settings

When I tried to load weights from checkpoints on my custom model using multi-gpus, weights are not loaded and the progress bar shows stop. I am using H100 x 7 on runpod, and when I did same trial on my local server (A6000 x 6), it worked well. Do you have any idea?...

otakuhero

2/27/2024

Container fails to start randomly

Container fails to start randomly error pod id 840b98harmlsgb wqaz2xufma32pt eqyabu82t6l3y9...

AutoK

2/26/2024

s3 slow upload

I'm currently working on uploading a dataset from S3 to my Pod using cloud sync. The dataset I'm uploading from S3 is about 1TB in size, so I set the volume size to 2TB. However, when I check the progress bar, it shows as follows: 314.897 GIB / 316.270 GIB 100% 13.045 MIB/S ETA 1M47S The upload seems to be progressing extremely slowly, and I can't estimate how long it will take. Could I be doing something wrong? I'm using 7 H100 GPUs, and the billing is adding up even though I haven't started working on the project yet. I would appreciate any quick help....

otakuhero

2/25/2024

About the cost of container initialization phase

Hi, I have a little question, If it takes 5 to 10 minutes for my pod to start pulling the image, do I need to pay for this time?🤔 My custom image is so large that it takes a long time to pull it every time I create a pod...😔...

DreamGen

2/25/2024

Broken CUDA / PyTorch on H100

/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

...

martinkallstrom

2/25/2024

Cannot connect to pod, web UI stating "Network Issues", https://uptime.runpod.io/ showing all green

I am unable to continue work since it's not possible to connect to my pod over SSH, despite https://uptime.runpod.io/ is showing all green. "Just" using it for development right now but every hour of lost work time costs a lot. I topped up my account with $1000 recently and now I'm not so sure if that was a good idea. Will Runpod be a viable option for production?...

semantic_search

2/25/2024

Cannot connect to CPU pods

As titled. The HTTP service doesn't work, and I can't start a web terminal neither...

Previous Next

Gaming

Programming

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!