Pods issues
Hello, I can't access my pods for around 13 hours... Some of them have this warning: We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime., but there are a lot more without any warning which I cannot access. Please help
Maintenance scheduled: 5 days downtime and data loss. What does this mean?
My pod is showing this message
Maintenance Scheduled
This machine is scheduled for kernel and driver update. Please transfer your data ahead of time, since there will be a dataloss.
Start: 06/24/2024 15:01 Local Time...
Ram issue
Hello guys, I am running the setup on the attached picture.
The image I am trying to pull is cognitivecomputations/dolphin-2.9.2-qwen2-7b from huggingface.
Even though I have a lot of RAM, I am getting this error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.25 GiB. GPU ...
free credits
Hi can i get 1 hour free credit with 24 gb GPU for test if my script work? If yes i will buy credit.
Empty Root - No workspace on Exposed TCP Connection
I have just created a connection over exposed tcp for the first time and finally got to ssh into my machine. However when I ls my actual installlation, nothing is there. It is frustrating as I am used to the "workspace" folder that is needed in order to save files between uses of the machine. Did I miss something in the setup, or is this how it is supposed to be?
Disk quota exceeded
I have disk free ~ 19 GB on workspace in my pod but stil I am getting disk quota exceeded. Any leads pls. THanks in advance
How to exclude servers in planned maintenance?
I'm preparing the production environment for our release this weekend. When I pick 4 x RTX 4000 ada I end up with a server that is flagged for maintenance in the coming days. Is there a way to exclude servers that are planned for maintenance?
Thanks...
Run multiple finetuning on same GPU POD
I am using
- image: runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
- GPU: 1 x A40
While running qlora finetuning with 4 bit quantization the GPU uses approx 12 GB GPU Memory out of 48 GB, how can I run multiple finetunings simultaneously (in parallel) on the same POD GPU?...
Problem connecting to ComfyUI
I'm running the Stable Diffusion Kohya_ss ComfyUI Ultimate template on an RTXA500, pod ID: uj3551nw4ul5l9
The pod seems to start fine, and allows me to connect to all the ports (including JupyterLabs port 8888) except for ComfyUI port 3020. I've attached screenshots of every relevant detail I could think of.
Thank you!...
Solution:
your volume is full and it might cause issues
SD ComfyUI unable to POST due to 403: Forbidden
As i used ComfyUI locally, there was no problem, but when im using my Pod as Backend im trying to POST through flask on https://|[id]-3000.proxy.runpod.net im always recieving "ERROR in app: Error during placeholder: HTTP Error 403: Forbidden"
Is that even possible? Is there another way of doing that?
In my flask app.py im trying to do this:
ws = websocket.create_connection(f"wss://{server_address}/ws?clientId={client_id}")
server adress would be [id]-3000.proxy.runpod.net...
Solution:
Ok, I fixed It. I just had to change the exposed Port from http to tcp and access it via the open IP plus the port
What is the recommended GPU_MEMORY_UTILIZATION?
All LLM frameworks, such as Aphrodite or OobaBooga, take a parameter where you can specify how much of the GPU's memory should be allocated to the LLM.
1) What is the right value? By default, most frameworks are set to use 90% (0.9) or 95% (0.95) of the GPU memory. What is the reason for not using the entire 100%?
2) Is my assumption correct that increasing the memory allocation to 0.99 would enhance performance, but it also poses a slight risk of an out-of-memory error? This is paradoxical, as if the model doesn't fit into RAM, it is expected to throw an out-of-memory error. I have noticed that it is possible to get an out-of-memory error even after the model has been loaded into memory at 0.99. Could it be that memory usage can sometimes exceed this allocation, necessitating a bit of buffer room?...
Solution:
0.94 works
Install Docker on 20.04 LTS
hello all,
trying to run containers on docker on a pod with ubuntu 20.04.
after docker install and running the "hello world" docker test i get this error:
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?....
Solution:
Pods are already docker containers, you cannot run docker inside of docker
Pod GPU assign issue
Recently I started noticing that sometimes any new pod initialising is stuck at this step. Sometimes it works, sometimes it won't. Anyone else facing this?
---------stdout------
Unable to determine the device handle for GPU0000:08:10.0: Unknown Error
---------stderr------...
Pod Unable to Start Docker Container
I've tested this Docker image on my local computer and other servers, however on Runpod it seems to be stuck in a loop displaying "start container". Is this an issue others have encountered before?
How can I install a Docker image on RunPod?
I had a chat with the maintainer of aphrodite-engine and he said I shouldn't use the existing RunPod image as it's very old.
He said there is a docker that I should utilise: https://github.com/PygmalionAI/aphrodite-engine?tab=readme-ov-file#docker And here is the docker compose file:...
He said there is a docker that I should utilise: https://github.com/PygmalionAI/aphrodite-engine?tab=readme-ov-file#docker And here is the docker compose file:...
CPU Only Pods, Through Runpodctl
Heyo! Is there a way to create cpu only pods through runpodctl? I don't see a flag for cpu type, but rather number of vCPU's and GPU's
Solution:
It is not yet supported currently