RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Create new pod with runpodctl

I'm trying to create a pod with runpodctl. It appears by reading the --help that I cannot create a pod using network storage for /workspace? I didn't find the correct option to pass. Maybe with --args ? Bonus point: how can I create a pod with specific requirements? Eg: Start a pod with 48 GB of VRAM with cost less than $1/hr. It could start a pod with 2xA5000 or 1xA6000 depending of available resources....

Community cloud servers repeatedly fail to correctly download containers

A100 PCIe 80GB servers repeatedly fail to download parts of containers. Easiest template to reproduce this on is the Text Generation Web UI and APIs template. simply start it, and if you get a CUDA error, it's because that part of the container silently failed
Solution:
Sounds like you didn't read the README and select the correct CUDA version.

Urgent: All new gpu pods are broken

Hi, our existing pods and new pods we are creating are having all same issue where they cannot find cuda devices, all giving error Warning: caught exception 'CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.', memory monitor disabled...

CPU Pods NOT WORKING

I'm having issues with my template that worked for months and also with your template: runpod/base:0.5.1-cpu This is the error: 2024-03-14T11:35:15Z error creating container: Error response from daemon: invalid --security-opt 2: "privileged-without-host-devices=true"...
No description

GPU usage when pod initialized. Not able to clear.

Tried nvidia-smi -r, restarting, and reseting. There is still usage on one gpu in the pod.

Chat History, Memory and Messages

I have a general usage question for runpod; when I open the web UI, under Parameters > Chat history I can upload a chat history .JSON. Now, when I uploaded it, how do I make the AI Chat actually use of it, so I can continue this conversation? Another question would be where I put the stuff that I want the AI to remember at all times. And is it possible to edit messages?

Increase number of GPU-s in the existing pod?

Hey folks, I have an existing pod with 2 A100 GPUs. I want to add two more to it. Is it possible? I didn't find it in the UI.
Solution:
Not possible, can only be done when you create the pod.

Keeping reverse proxy hostname between destroy/start

Hello, I'm using network storage for my pod. My use case doesn't require the container to be up 24/7. I noticed there was no stop button on the web GUI but I was able to start/stop container with the API. So I did. I think this is what could cause my pod to run without GPU attached. I found that I can only start the pod without GPU on the web GUI this morning. I was trying to stop container instead of destroying it because I want to keep the same container id, so my reverse http proxy hostname doesn't change each time....
Solution:
At this time there's no way to keep the same pod ID after you terminate a network volume pod - there's no stop state for network volume pods (it's really specifically designed for machine based storage so the storage can be kept and the GPU can be freed up) I can definitely see how this would be a feature gap though so I will bring it up to the team...

Cuda 12.0 version template is missing

Can you please let me know on how to get Pytorch 2.0.1 and Cuda 12.0, this template i was using for months now and it is gone. All of my traning doesnt work on the other templates. Thanks.

Not able to connect to Web Terminal after increasing the container disk size of the pod

I created a GPU pod, and was able to connect to Web Terminal fine. However, due to disk space issue, I increased the container disk from 20G to 80G. After that, I was not able to connect to the pod. Always complains that connection closed.

Need to move credit from personal account to team account

Hello, we wanted to add credit to the team account, but it wasn't clear we were adding to personal account. Please help us move credit. This is very very very urgent.
Solution:
Beautiful

Waiting for hours

Hello, I'm waiting for my gpu to connect, However, I'm waiting hours for this " waiting for the logs" vWhat should I do? Görsel ID: 4smb1047x6p4qs...
No description

error in pod

"2024-03-12T08:39:43.053682465Z /usr/bin/python3: Error while finding module specification for 'vllm.entrypoints.openai.api_server' (ModuleNotFoundError: No module named 'vllm')" i always run this in A6000, but it is getting error, why this is happening??...

Why are secure cloud pods so slow?

I'm pretty sure I just wasted a few hours of time trying to find a decent pod that isn't being bottlenecked by it's other hardware. I only managed to find 1 pod a few days ago that was giving me 3 it/s while training a model and it was a community pod.

Different levels of performance from same GPU types in Community Cloud

When I use an A5000 in Community Cloud, I am able to get over 3it/s training Kohya_ss in ES and FR regions, but only a pathetic 1s/it in BG region. If different hosts are going to offer different levels of performance, they should not all earn the same fixed rate. A less performant host needs to earn less than the ones that have decent performance.

No GPU, RO RTX4090 node

Hi, seem like there is an issue with RTX4090 Romanian node. Seem like there is no GPU attached while I pay the regular price (not cpu-only). Maybe not related, but there is something that prevent me to start it with 2 GPU too. nvidia-smi command returns "Failed to initialize NVML: Driver/library version mismatch" and llama.cpp says "ggml_init_cublas: no CUDA devices found, CUDA will be disabled". No issue with RTX4000 Romanian nodes....
Solution:
@f4242 you use ubuntu template that does not have CUDA preinstalled use RunPod Pytorch instead.

Could not find CUDA drivers

I am experiencing issues with the Stable Diffusion Kohya_ss ComfyUI Ultimate template. I have setup an RTX 3090 pod, transferred the training images and setup Kohya. I am really new to RunPod, so I apologise if I'm misunderstanding something or missed something obvious. When I begin training, the Kohya log file displays the following message:...
Solution:
Check your GPU memory and GPU utilization and you will see that the GPU is being used. This is just some weird tensorflow error.

Ignore root start.sh and use custom persistent script.

Im trying to avoid using start.sh since I need to experiment with some different processes. I've tried to copy the contents of start.sh and to point via the container start command to a install_req.sh script that is in the workspace folder. Im also unchecking "start jupyter notebook" and "ssh terminal access" since I don't want the container to run the original start.sh file. Maybe here Im confusing how these work or Im missing something. The logs show that all is good but I can't use the runpod ui to start jupyter anymore and the actual ssh connection does not work anymore. Why is that happening although install_req.sh is the same as start.sh in the root dir? ...

streamlit app not loading up on CPU node

This is my dockerfile ``` FROM runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 WORKDIR /workspace...
No description

Issues with changing file permission to 400

I have a ssh key that I'm trying to set the permission as 400 by running the following command chmod 400 id_rsa_git upon running ls -l I'm seeing the permission as 444...