RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Training runs 2-5x slower on pods than on home system.

Home system: 4090, 7950x, 64GB RAM, W.2 SSD. I comparisons: 1x 4090: 2.5-3x slower on ALL ops. L40: 5x slower...
No description

Completely lost connection to volume at CA-MTL from different account at now

About ten miniutes ago the serverless and pod service all went dead, meets diffenerent error related to File IO. Error: [Errno 6] No such device or address...

how do I fork a community template

I want to increase the volume of a community template, how can I do this? I asked the AI and it said I can copy templates from the Explore Templates section, but I don't see any button/UI to do this unfortunately...

Authenticate to AWS ECR private repository

Hello, I have my private image on AWS ECR. I have created a new IAM role that can pull that image and I have fulfilled those credentials inside the template.
However, when I run a pod I'm getting some errors: error pulling image: Error response from daemon: unauthorized: Not Authorized What's the correct way to pull images form AWS ECR?...

Run commands on restart

I want to run some commands on restart, that will not run on initial start but all the restarts after that, so how can I do such a thing? cause I dont wanna use Container Start Command, considering it will run initially as well and would require me reconfigure and start from scratch in my current running machines.

Wandb giving 403 error

When running a training job in a L40 instance with a custom template, I get the following error: `` wandb: W&B API key is configured. Use wandb login --relogin` to force relogin 2025-02-25 03:27:06,293 - ERROR - 403 response executing GraphQL. 2025-02-25 03:27:06,293 - ERROR - ...

Can't stop my pod! Only terminate

I have an H100 SXM running on koboldcpp - I can't figure out how to stop the pod. There's no stop button in the UI. I can only terminate it. How do I simply pause it, so that I can unpause it later and continue?

I am trying to send my LoRA to runpod but I keep getting 'room not ready' on the web terminal

Here is my input: 1. cd ComfyUI/models/loras 2. runpodctl receive ... =And here is where the error arises and it says 'securing channel...room not ready' I don't know if this is relevant but the first time I tried this it worked but it only donwloaded 70% of the way, so I restarted it. Idon't know if I need to do it all over again or what. I've tried dping it with different pods but it will not work. Please help....

jupiter

plz i need your help guys!!! i always start my pod then conect to comfyui port and jupiter port 8888, now i try to connect jupiter but it not connect in put the link in browser and do nothing ??? plz you help is say " https://<my pod -id>-8888.proxy.runpod.net/lab?

using iptables with pods whilst maintaining jupyter access

Has anyone managed to do this? I've been installing iptables and some rules in the container start command. I set a rule to drop all outgoing packets and then selectively add in exceptions. One of those exceptions is to able to communicate on Jupyter's 8888 port. However, when I start the pod, I no longer have the option of connecting to the pod via jupyter. any ideas?...

Limit Memory Usage

Multiprocessing is requiring a lot of memory usage and the server just crashes when the threshold is reach (needing a pod restart). Is this the intended interaction? Is there are a way that I can prevent this interaction from happening so I don't have to keep restarting the server? Perhaps a way to set a server-wide memory usage limit before the threshold is hit?
No description

H100 pod not connecting to network drive of the same region

I have a dual H100 pod that's supposed to be connected to a network drive (both on CA-MTL-1), but when I try to move data, do a git status of a repo, or even start a python script residing on the network drive the terminal hangs. Seems like a network issue? I've trying to spawn dual H100 pods multiple times, but I'm getting the same IP (probably the same hardware?), so nothing changes. Trying this out from a machine with RTX A5000 works fine! Is there something I can do?...

something wrong with pytorch2.4.0 image's jupyter

most of my pod created today using template pytorch2.4.0 couldn't open jupyter lab, while 2.2.0 was fine. Wonder some updates on the docker image.
No description

4 x A40 never ready in CA

Create 4 x A40 Pod today in CA, however Pod never ready state no log no connect...
No description

Unable to connect to pod after launch H100s

Today consistentatly this seems to be happening. Everytime we launch a H100 GPU
No description

Pod image for network storage management

Hi, I use runpod mainly for serverless ComfyUI, using a network storage to host medias and models. To manage the network storage I assumed there is no other way to use a Pod as file manager. Maybe there are other solutions? ...

storage full error, disk write error

I am trying to unzip files of 1GB, and I have such 1200 files. each zip contains around 1,00,000 images. When I unzip those, first of all it takes good amount of time, and second, after some time, I get this error 3/SynthImage/test/815/a55815_11_0_275.jpg: write error (disk full?). Continue? (y/n/^C) even though disk is of 2048 GBs. and if I continue, it runs for some time and then same error....

Ask the service rate limite and etc.

Can the service runpod.io meet such needs:I would like to convey our usage scenario. Specifically, we are looking to provide a public network service, with initial users estimated to be around 2,000 to 10,000 (about 2,000 to 10,000 teachers from 30,000 middle schools). If each user has about 10 uses per day, that would result in approximately 20,000 to 100,000 requests. In this case, is there a possibility that runpod.io's rate limiting or circuit breaker would be triggered? Is it possible to co...

Why is there still a daily charge after purchasing pod A40-48G with a one-time payment?

I purchased a GPU A40 *1 48G pod in Secure Cloud mode on February 17, Volume Disk: 60G Container Disk: 30G ...

error on gpu causing damages

could i request a refund for this GPU? the CUDA is not working and the experiment i did is now broken, unuseable.
No description
Next