RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

TCP port external mapping keeps changing every time pod restarts.

I’m setting up my remote development environment using ssh sftp. Everything works fine. However, every time the pod restarts, the external mapping for port 22 keeps changing, which makes my local IDE unable to connect to the remote because now it’s on a different port. Is there a way to fix this? Or any workaround if it’s by design? Thanks in advance.

I get AttributeError

AttributeError: module 'gradio.layouts' has no attribute 'all'

Controlnet SDXL Models Don't Work

Hi there, I recently started using stable diffusion with Automatic1111 on runpod. If I use the preinstalled controlnet models for 1.5 saftensors everything seems to work perfectly but if I install Controlnet models for SDXL they do not seem to respond to the server. Does anyone know how I can use conrtolnet with SDXl models on runpod? I use the template RunPod Stable Diffusion Thank you for your help!...

Extremely poor performance PODs with the RTX 4090

Hi. I'm building DeepFake with DeepFaceLab and today I've run already 3 PODs with rtx 4090 and they all give different performance, and very bad. A couple of weeks ago I did the same work I'm doing now. My POD was with rtx 4090 and was giving a performance of 0.250ms per iteration. CPU utilization was 20-30% and GPU utilization was over 90% always. Today I ran the same process on three PODs with RTX 4090 and they are running extremely weird. On one the performance was 0.850ms per iteration. On the other two about 1.100ms per iteration! All three PODs have CPUs loaded at 100% and I tested the GPUs with the command (nvidia-smi) for a long time and got strange results. GPUs are not loaded most of the time and only have one-off spikes up to 5-30% from time to time. ...
No description

Error on RunPod Pytorch 2.1

I have been running a notebook for 4 days now and checked it this morning only to find an error and the notebook not responding. Can the files that i have been creating still be salvaged? the error in the system logs rads the following; "error starting container: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: device error: GPU-01b243b9-1ca8-6920-fe03-1e4fa929b889: unknown device: unknown"...

No CUDA GPU available after not using GPU for a while

Hi! I need some help regarding my GPU pod. My pod shows no cuda GPU available out of nowhere a lot of times and only gets fixed if I restart the pod. nvidia-smi output: Failed to initialize NVML: Unknown Error ...

Hi! Sometimes I can download models from Civitai, using wget. But other times, I can´t. Example:

Solution:
curl -LOJH "Authorization: Bearer xxxxxx" https://civitai.com/api/download/models/342732?type=Model&format=SafeTensor&size=pruned&fp=fp16
curl -LOJH "Authorization: Bearer xxxxxx" https://civitai.com/api/download/models/342732?type=Model&format=SafeTensor&size=pruned&fp=fp16
you would need generate own API key and replace xxxxxx...

Kernel version discrepancy between Pods.

I have rented several GPU pods with 4090's and I sometimes get warnings about the kernel version being too low. So the host system is not running the same Linux kernel version. How can this be fixed? I am using this base docker and it works 90% of the time, but when I get the error I can start a new Pod and not get the error.
nvidia/cuda:12.1.0-devel-ubuntu22.04
nvidia/cuda:12.1.0-devel-ubuntu22.04
...

Whatever I do, the ports do not open for the service

1. There is nothing in the logs that indicates something is wrong (attached screenshot) 2. I've tried multiple images and GPU types....
No description

API to query Pods

Is there an API available to query our pods and the utilisation on each pod?

Exposed Port 8888

I'm not running Jupyter, but I've left 8888 exposed as I'm running another service on that port. However I cannot connect to 8888 remotely, only locally on the machine. Is there any other setting I need to configure?

Question about Pods and data

Hi! I have a quick question for pods regarding how data on them works. Let's say I use the ComfyUI stable diffusion template, but I want to add some models, so I go into the pod, and I add some models in whatever way, whether through CLI or the ComyUI manager thing. If the pod goes down or some kind of interruption happens, do I lose my custom models when the pod restarts?...

Availability of A40, A6000

What is the region with the highest availability of the above GPUs? Looking ot deploy an endpoint and want to ensure minimum throttle.

Slow CPU

Hi, I'm facing a big problem! On A100, L40 graphics cards in Secure cloud I am experiencing very low CPU performance On 3080, 3090ti in community cloud CPU speed is quite good...

slow GPU across many community cloud pods

just today I am having an issue with stable diffusion speeds anywhere from 4 it/s to 5 s/it, mostly the slow side. I'm on the third pod I've tried - two 3090s and now a V100, it's made no difference and it's making it unusable. Is there something silly that I might have done to cause this, since it doesn't seem to be the fault of the pod? ashleyk...

CPU Pod with shm size larger than physical RAM

I would like to memory map a large file that is larger than the available physical RAM. IIUC this requires changing the size of /dev/shm. What is the best way to do this in RunPod?
Solution:
You can’t do it

With a custom template true ssh ask for a password, proxy ssh works perfectly.

I have created my custom docker image from one of the official docker images of pytorch 2.2.0. I can connect with proxy ssh, but when connection with true ssh my computer offers the correct key but the servers ask for passwork anyway. I can connect through the proxy ssh and everything is fine. The public key is authorization file in the root/.ssh directory. Sshd config seems the same as in other pods. I am using the same start.sh script as in other instances. I see in the log that the script is...

multiple nodes

Hello, it is possible to get multiple H100 SXM5 nodes for a multi-node run?

Can't access pods after network outage

Two of my pods say that they've suffered a network outage and now I can't access them, it keeps getting stuck on startup with the message "Waiting for logs". Are these pods unreachable? How long should I wait? Is there any way I can retrieve the mounted volumes? ID: rob1e5oebdvrsa ID: g26eqo9wacdosd...
Solution:
@Papa Madiator Got access to the pods again! Thanks!

wget doesnt work on civitai models

I tried to use wget with link to download but it says unauthorized acess and nothing happens. It was a 5gb SD1.5 model. How do i fix this?