RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Error while running ComfyUI

When I use python main.py --listen command, I get this error Error handling request Traceback (most recent call last): File "/workspace/venv/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 350, in data_received...

GPU cloud storage GONE + billed for entire month

I'm seriously starting to think you guys have something against me. This is the second time you've lost my pod since I started casually using your service, but this time it's borderline illegal. I've been billed for the entire month of Feburary with no indication of when, or if you deleted my pod again. I specifically chose a different setup so it WOULD NOT get deleted like last time. ...
No description

Trying to create a Spot GPU instance leads to 400 response error

Greetings, I've recently noticed that whenever I try to create a GPU spot instance, I'm getting a 400 responses error. I'm trying to spin up a spot instance using the RunPod PyTorch 2.2.10 image, with 2x A6000, a network drive, in the Sweden datacenter. ...
Solution:
Yeah we pushed a fix just a little bit ago
No description

Where are all the U.S. network volume data centers?

It's been like this for at least a week. I'm in the U.S. There used to be Kansas options, but now I can only select Canada or Europe. Is there some kind of data center outage?...
No description

Managing multiple pod discovery

Hi, if I want to put a load balancer/queue system for multiple pods, is there some premade app I can use for that? I was thinking of something like Kubernetes but its not compatible with Podrun. Or is this not the use case of podrun?

How to withdraw money ?

How can i withdraw the money left on RunPod? I am done with my task and want to get the remaining money back

inconsistent speeds--community pod, any tips

Doing some upscales in SD forge--abit intensive 15mb image output, but ETA/UI seems to keep hanging, any troubleshooting tips? are community pods less stable?
No description

H100 PCIe and SXM stability issues

I have been working on 8xH100 PCIe. While intially working well, after some time they issue CUDA errors. Overall seems to be unstable. I always install transformer engine to enable FP8, maybe some incompatibility has come up. Then I got the chance to test a SXM system, but strangely with this one (a 6x) the whole process haltet just before training. I'm using axolotl for everything. ...

2024-03-01T16:08:54.761577365Z [FATAL tini (6)] exec docker failed: No such file or directory Error

Hey folks I'm having trouble running my image on Runpod. My image works properly on a normal root access Docker environment but doesn't work on the Runpod template structure. It's my first time on Runpod, can anyone help out?
No description

I want to install docker in a GPU pod.

I want to install docker in a GPU pod. Yes, I am aware that the pod itself is a docker container but I want to run 2 different docker containers in one pod.
Solution:
You can't run docker in docker on GPU pods.

OpenBLAS error

Hi, all. I got this error "OpenBLAS blas_thread_init: pthread_create failed for thread 3 of 64: Resource temporarily unavailable OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max" when running trials. I searched and found this (https://github.com/HumanSignal/label-studio/issues/3070), it says I need to add options to docker command. I would like to ask what is the default docker command, or how can I solve this question in general. Thanks!...

We have detected a critical error on this machine which may affect some pods.

Hey all. We're renting a number of H100s as a trial run of Runpod as we are looking for another compute provider. We paid for 24 hours of compute in order to transfer terabytes of data onto the machine, alongside paying for bandwidth and additional storage. We additionally paid our cloud provider egress costs, which is more than we paid for the H100 machine, and rented a disk & network optimized machine in order to transfer the data quickly to the Runpod machine. After 24 hours, we are getting this error on the Runpod GUI: We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime. ...
No description

Is it possible to restart the pod using manage Pod GraphQL API?

Is it possible to restart the pod using manage Pod GraphQL API?

Training for days

I want to train my model for days using a single GPU. How do I keep my Jupyter Notebook session to persist even after I close my laptop so that training continues?

Disk reading unacceptably and mind boggingly slow

I thought I had figured out where to put my data: pay for extra disk space in / move the data from /workspace (which is a network drive) to / and from there it can be read fast enough. But today, I tried the same thing in this pod: ...

"Pricing error for savings plan"

Website says "Pricing error for savings plan" when I try to create a savings plan for my A6000 server. Both 3 month and 6 month plans just give an error message and I'm unable to create a savings plans.
No description

/workspace not writable

When I turned off the pod (ID: n2srovqha2mlj5), everything was working. I turned it on and I can no longer write to /workspace ``` $ echo test > /workspace/file $ cat /workspace/file $...

Tokenizer error

OSError: Can't load tokenizer for 'openai/clip-vit-large-patch14'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'openai/clip-vit-large-patch14' is the correct path to a directory containing all relevant files for a CLIPTokenizer tokenizer. Can someone help me? This suddenly appeared. Running the normal Stable Diffusion template. Images fail to generate....

How to use the comfyui API when running it inside Runpod GPU pods

I can use the UI running on port 3000 using the template runpod/stable-diffusion:comfy-ui-5.0.0 but I am not able to call the API is there any documentation or examples for this scenario. I am using this example code top call the API https://github.com/comfyanonymous/ComfyUI/blob/master/script_examples/basic_api_example.py Please help....

GPU Host Registration

I run a LLM infra startup funded by a few top tier VCs - we built our own dedicated cluster for research, but have spare capacity that we would like to register as a host on runpod ( 16 H100s and 16L40S). Curious if someone could DM me about the process for hosting? I read on the website that I should be pinging in Discord. Thanks!