RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Pods issues

Hello, I can't access my pods for around 13 hours... Some of them have this warning: We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime., but there are a lot more without any warning which I cannot access. Please help

Maintenance scheduled: 5 days downtime and data loss. What does this mean?

My pod is showing this message Maintenance Scheduled This machine is scheduled for kernel and driver update. Please transfer your data ahead of time, since there will be a dataloss. Start: 06/24/2024 15:01 Local Time...
No description

Ram issue

Hello guys, I am running the setup on the attached picture. The image I am trying to pull is cognitivecomputations/dolphin-2.9.2-qwen2-7b from huggingface. Even though I have a lot of RAM, I am getting this error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.25 GiB. GPU ...
No description

free credits

Hi can i get 1 hour free credit with 24 gb GPU for test if my script work? If yes i will buy credit.

Can I run docker desktop on

runpod desktop? I have try long time I cannot run it

Empty Root - No workspace on Exposed TCP Connection

I have just created a connection over exposed tcp for the first time and finally got to ssh into my machine. However when I ls my actual installlation, nothing is there. It is frustrating as I am used to the "workspace" folder that is needed in order to save files between uses of the machine. Did I miss something in the setup, or is this how it is supposed to be?

Disk quota exceeded

I have disk free ~ 19 GB on workspace in my pod but stil I am getting disk quota exceeded. Any leads pls. THanks in advance

How to exclude servers in planned maintenance?

I'm preparing the production environment for our release this weekend. When I pick 4 x RTX 4000 ada I end up with a server that is flagged for maintenance in the coming days. Is there a way to exclude servers that are planned for maintenance? Thanks...

Run multiple finetuning on same GPU POD

I am using - image: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 - GPU: 1 x A40 While running qlora finetuning with 4 bit quantization the GPU uses approx 12 GB GPU Memory out of 48 GB, how can I run multiple finetunings simultaneously (in parallel) on the same POD GPU?...

Can I download audit logs?

Is there a way to fetch or download audit logs?

Problem connecting to ComfyUI

I'm running the Stable Diffusion Kohya_ss ComfyUI Ultimate template on an RTXA500, pod ID: uj3551nw4ul5l9 The pod seems to start fine, and allows me to connect to all the ports (including JupyterLabs port 8888) except for ComfyUI port 3020. I've attached screenshots of every relevant detail I could think of. Thank you!...
Solution:
your volume is full and it might cause issues
No description

SD ComfyUI unable to POST due to 403: Forbidden

As i used ComfyUI locally, there was no problem, but when im using my Pod as Backend im trying to POST through flask on https://|[id]-3000.proxy.runpod.net im always recieving "ERROR in app: Error during placeholder: HTTP Error 403: Forbidden" Is that even possible? Is there another way of doing that? In my flask app.py im trying to do this: ws = websocket.create_connection(f"wss://{server_address}/ws?clientId={client_id}") server adress would be [id]-3000.proxy.runpod.net...
Solution:
Ok, I fixed It. I just had to change the exposed Port from http to tcp and access it via the open IP plus the port

What is the recommended GPU_MEMORY_UTILIZATION?

All LLM frameworks, such as Aphrodite or OobaBooga, take a parameter where you can specify how much of the GPU's memory should be allocated to the LLM. 1) What is the right value? By default, most frameworks are set to use 90% (0.9) or 95% (0.95) of the GPU memory. What is the reason for not using the entire 100%? 2) Is my assumption correct that increasing the memory allocation to 0.99 would enhance performance, but it also poses a slight risk of an out-of-memory error? This is paradoxical, as if the model doesn't fit into RAM, it is expected to throw an out-of-memory error. I have noticed that it is possible to get an out-of-memory error even after the model has been loaded into memory at 0.99. Could it be that memory usage can sometimes exceed this allocation, necessitating a bit of buffer room?...
Solution:
0.94 works

Install Docker on 20.04 LTS

hello all, trying to run containers on docker on a pod with ubuntu 20.04. after docker install and running the "hello world" docker test i get this error: docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?....
Solution:
Pods are already docker containers, you cannot run docker inside of docker

Pod Network Issue Stuck

Pod id: ul5cbpu7iavded

Pod GPU assign issue

Recently I started noticing that sometimes any new pod initialising is stuck at this step. Sometimes it works, sometimes it won't. Anyone else facing this? ---------stdout------ Unable to determine the device handle for GPU0000:08:10.0: Unknown Error ---------stderr------...

Pod Unable to Start Docker Container

I've tested this Docker image on my local computer and other servers, however on Runpod it seems to be stuck in a loop displaying "start container". Is this an issue others have encountered before?
No description

How can I install a Docker image on RunPod?

I had a chat with the maintainer of aphrodite-engine and he said I shouldn't use the existing RunPod image as it's very old.
He said there is a docker that I should utilise: https://github.com/PygmalionAI/aphrodite-engine?tab=readme-ov-file#docker And here is the docker compose file:...

CPU Only Pods, Through Runpodctl

Heyo! Is there a way to create cpu only pods through runpodctl? I don't see a flag for cpu type, but rather number of vCPU's and GPU's
Solution:
It is not yet supported currently