RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

cloud sync fail

Syncing to Dropbox failed; it always shows: Something went wrong! some detail:...
No description

Can not start docker container

I use custom docker image: Here is system log: 2024-08-15T04:53:11Z start container Here is container log: 2024-08-15T04:52:55.667454224Z /usr/local/bin/docker-entrypoint.sh: line 414: exec: docker: not found SSH to this pod response Container not running....

libcudnn.so.9: cannot open shared object file: No such file or directory

Getting this error when using the CUDAExecutionProvider with onnxruntime-gpu. I'm building the container for cuda 12 and installing onnxruntime-gpu 1.18 directly from microsoft's package index to fully support cuda 12. nvidia-smi works inside the container. not sure why im getting the issue.

Can't access pod

it's been down over 16 hours, would be great if this can be dealt with ASAP. Stuck on Waiting for logs if I try to turn it on
No description

Multiple containers on a single GPU instance?

Are there any plans to allow multiple docker containers on a single GPU instance? I have workloads which do not utilize the full resources of a single GPU, and I'd like to be organize the workloads using multiple containers sharing a single GPU. I don't believe there is a way to do this currently, the closest is to run multiple processes inside a single docker container, but that is a docker anti-pattern and not very good for workload organization.

Connecting Current Pod to Network Volume

Hello, is there a way to connect a current pod to a network volume, or would I have to transfer all the data into a network volume and set up a new pod. If that is the case, what's the fastest way to do that (I have a large dataset I would have to move)?

Weird error when deploy lorax inference server

Hi guys, i'm trying to deploy the lorax inference server on runpod A100 PCIe pod. I got a very weird error attached in the image. Why the error is weird? Because it only happened for some pods but not all, do you guys know any reason about this?
No description

Passwordless SSH doesn’t work half the time.

I’m using pods in the secure cloud. Half the time, I can’t SSH in and it asks for a password. My key is in authorized files, all the settings for the ssh server are right, but it won’t accept my key. Debug logging gives no reason why. The template is a standard PyTorch 2.2 template from RunPod. The only thing I can do is set a root password and allow using it for SSH and enter my password every time, which is very annoying. Happens all the time and then every now and then it doesn’t and I can SSH in fine without a password. Nothing different on my end. Same template, same scripts doing the login. ...

Flux in Runpod Stable Diffusion WebUI Forge doesn't work in Runpod, although it seems to be possible

I've seen your tutorial to run Flux in Runpod in your blog, but it doesn't work for me, got many errors that I can't solve, I'm not a programmer, sorry 😦 I would like to install Flux in Forge. Why it doesn't work in the version is running in Runpod, is it going to be possible...

vllm seems not use GPU

i'm using vllm and on the graph, when i launch some request, only cpu usage increase. if i open a terminal and launch nvidia-smi, i didn't see any process too. settings line...
No description

Updated a1111 and now i cant connect to the webui port

used git checkout master and git pull in the terminal to update and now i cant connect the port. im getting a 502 | README | Runpod. I already tried deleting the venv and waiting 30 min; no luck. using the official runpod a1111 template

Pod resume failed: This machine does not have the resources to deploy your pod.

Hello! I'm getting this error: Pod resume failed: This machine does not have the resources to deploy your pod. Please try a different machine My pod is a RTX 3090 , 10GB container disk and 60gb volume disk. How can I prevent this from happening?...

Help! My Port 3000 (A1111 web-ui) isn't starting up.

i'm using the ashleykza/a1111 template. It's been working fine till today when I uploaded some new LoRAs.

Can't update custom nodes ComfyUI

New install , update comfy , try to update comfy manager , nothing happens , What am I doing wrong ?
No description

pod with custom template have no tcp ports exposed

Hi, I just created my custom template, and I set the ports to be exposed in the template, but after I deploy a pod, it has no ports exposed, did I configure something wrong?...
No description

IS disk slow

IS(1 ,I think) disk speed is going at 658 MBps while others like US-OR are going at +4000 MBps.
No description

Community runpod template error (Comfyui ashleykza)

I'm trying to deploy this community runpod template Comfyui Ashleykza, but I'm getting this error. How can I proceed?
Solution:
so runpod/comfyui? Cannot find that. one But I found aitrepreneur/comfyui:2.3.5. Testing it now....
No description

Syncing taking too long?

Hi everyone. I'm using ULTIMATE Stable Diffusion Kohya ComfyUI InvokeAI pods. It works well yesterday, but when I tried to create it again today, it stuck on the sync of A1111 (image attached). I've wait a while for this to go through but no dice. I did this in the secure cloud. However when I tried using community cloud, the syncing went fine. Anyone knows what's happening?...
No description

How to store Model to Network Volume

I am saving my Huggingface model with save_pretrained. Which base path do I pass here so that model is saved to Network Volume instead of Container Disk...
Solution:
It is set in the Template. The default mounts to /workspace Often the best way to accomplish storing models there is to create a symbolic link into /workspace...