Training for days
I want to train my model for days using a single GPU. How do I keep my Jupyter Notebook session to persist even after I close my laptop so that training continues?
Disk reading unacceptably and mind boggingly slow
I thought I had figured out where to put my data:
pay for extra disk space in /
move the data from /workspace (which is a network drive) to / and from there it can be read fast enough.
But today, I tried the same thing in this pod: ...
"Pricing error for savings plan"
Website says "Pricing error for savings plan" when I try to create a savings plan for my A6000 server. Both 3 month and 6 month plans just give an error message and I'm unable to create a savings plans.
/workspace not writable
When I turned off the pod (ID: n2srovqha2mlj5), everything was working. I turned it on and I can no longer write to /workspace
```
$ echo test > /workspace/file
$ cat /workspace/file
$...
Tokenizer error
OSError: Can't load tokenizer for 'openai/clip-vit-large-patch14'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'openai/clip-vit-large-patch14' is the correct path to a directory containing all relevant files for a CLIPTokenizer tokenizer.
Can someone help me? This suddenly appeared.
Running the normal Stable Diffusion template. Images fail to generate....
How to use the comfyui API when running it inside Runpod GPU pods
I can use the UI running on port 3000 using the template runpod/stable-diffusion:comfy-ui-5.0.0 but I am not able to call the API is there any documentation or examples for this scenario. I am using this example code top call the API https://github.com/comfyanonymous/ComfyUI/blob/master/script_examples/basic_api_example.py
Please help....
GPU Host Registration
I run a LLM infra startup funded by a few top tier VCs - we built our own dedicated cluster for research, but have spare capacity that we would like to register as a host on runpod ( 16 H100s and 16L40S). Curious if someone could DM me about the process for hosting? I read on the website that I should be pinging in Discord. Thanks!
Help with constantly crashing GPU pods
Hello, I’ve been struggling for the past few days with trying to get a docker image up and running on a GPU pod. I had success with a template I made (docker image mcgillrobotics/mujoco:cuda118) and managed to connect and get things running, but since then I have not been able to successfully connect to a pod. The docker image pulls, but when I click “Connect to web terminal” nothing happens. When I try to SSH it says the container is not running and kicks me out instantly. I’ve tried different...
Solution:
Yes you need to add
sleep infinity
[Urgent] failed : Software caused connection abort
Can someone help with this error please? it's causing us a huge problem with our next release.
Trying to connect two different computers with pytorch and lightning via TCP ports. i have followed the directions that runpod advises for opening these ports (>70000):
https://docs.runpod.io/pods/configuration/expose-ports pytorch and nccl appear to start opening the connection just fine and then we get an exception: ...
how to distribute usage of GPU
I purchased 2 RTX A5000 runpods.
but my server is using only 1 GPU on 99%.
other gpu is 0.
is this wrong?...
Converting to Team Account
Hi, will my history usage and billing be public to my team if I convert my personal account into a team account?
Compatibility of RTX A6000 for Multi-GPU Training
I would like to inquire about the types of GPUs that support multi-GPU training. For instance, is it possible to engage in multi-GPU training using 10 RTX A6000 cards from a previous generation?
I understand that the H100 PCIe does not support multi-GPU training, while the H100 SXM5 does. Among the GPUs offered by RunPod, what other types of GPUs are capable of multi-GPU training?...
H100 multi-gpus settings
When I tried to load weights from checkpoints on my custom model using multi-gpus, weights are not loaded and the progress bar shows stop.
I am using H100 x 7 on runpod, and when I did same trial on my local server (A6000 x 6), it worked well.
Do you have any idea?...
Container fails to start randomly
Container fails to start randomly
error pod id
840b98harmlsgb
wqaz2xufma32pt
eqyabu82t6l3y9...
s3 slow upload
I'm currently working on uploading a dataset from S3 to my Pod using cloud sync. The dataset I'm uploading from S3 is about 1TB in size, so I set the volume size to 2TB. However, when I check the progress bar, it shows as follows:
314.897 GIB / 316.270 GIB 100% 13.045 MIB/S ETA 1M47S
The upload seems to be progressing extremely slowly, and I can't estimate how long it will take. Could I be doing something wrong? I'm using 7 H100 GPUs, and the billing is adding up even though I haven't started working on the project yet. I would appreciate any quick help....
About the cost of container initialization phase
Hi, I have a little question, If it takes 5 to 10 minutes for my pod to start pulling the image, do I need to pay for this time?🤔
My custom image is so large that it takes a long time to pull it every time I create a pod...😔...
Broken CUDA / PyTorch on H100
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
Cannot connect to pod, web UI stating "Network Issues", https://uptime.runpod.io/ showing all green
I am unable to continue work since it's not possible to connect to my pod over SSH, despite https://uptime.runpod.io/ is showing all green. "Just" using it for development right now but every hour of lost work time costs a lot.
I topped up my account with $1000 recently and now I'm not so sure if that was a good idea. Will Runpod be a viable option for production?...
Cannot connect to CPU pods
As titled.
The HTTP service doesn't work, and I can't start a web terminal neither...