RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Very inconsistent performance

I recently started using Runpod - and am a fan of the setup simplicity and pricing. I have recently noticed a huge amount of inconsistency in performance with identical training runs taking up to 3x longer to finish. I am on the secure cloud. Do you know why this may be?

Can someone help me fix my tensorflow installation to see the gpu?

I've been trying to fix this for over a week. Running the official template with pytorch 2.1.0, cuda 118...
No description

save state of pod to persistent storage?

HI, once I'm done training with a pod, is there a way to save my storage/current state off to a 'longer term' storage so I don't have to go through setting everything up again via ssh when i do my next training session?...

There's inconsistency in performance ( POD )

Hello. I rent and operate 20 RTX4090 GPUs all day long. However, there are significant differences in inference speeds. Each line in the table in the attached image represents 2 RTX 4090 GPUs. One processes 150 images in 3 minutes. However, the rest only process 50-80 images. On my own RTX4090 2-way server that I purchased directly, the throughput is 180 images processed in 3 minutes. I haven't been able to figure out why these speed differences are occurring. The inference task is generating one image....
No description

Pod's connection is less stable that the tower of babel

I'm trying to use ollama in a container on runpod as a pod and I keep running into connection errors over and over again. I've tried different pods, Secure Cloud vs Community and different GPUs but I keep getting timeouts like this:
ResponseError: <!DOCTYPE html> <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]--> <!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]--> <!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]--> <head> <title>3k27lkqzwstw36-11434.proxy.runpod.net | 524: A timeout occurred</title>
ResponseError: <!DOCTYPE html> <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]--> <!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]--> <!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]--> <head> <title>3k27lkqzwstw36-11434.proxy.runpod.net | 524: A timeout occurred</title>
...

Two pods disappeared from my account

After a two week hiatus from RunPod I returned frustrated to find that at least two (on-demand, secure cloud) pods are missing from my account. These take time, effort, and money to setup, and I was happily paying their storage costs. My account is set up to autopay and roll over, so the balance is always > $100 (i.e., this is not a non-payment issue). There have been no reported storage outages AFAIK, and my audit logs show no activity whatsoever between 8/17 when I last used RunPod and today. Billing however indicates one being dropped on the 23rd, and another on the 24th. Can anyone shed some light on what's going on here, and ideally help me restore my missing pods? Similar issues, for reference:...

Why cant I find any comfyui templates on explore!!!!

I keep searching for templates on explore and I cant find any comfyui templates.

comfyui api 401 unauthorized

I installed comfyui to runpod, which works fine in the browser. However, when I try to access the same URL through axios to use the comfyui api, I get 401 Unauthorized. what may be the problem?...

Spot

The pricing of Spot is really too tempting. As a student with little money, this price seems very cost-effective, but it is taken up too quickly. What are you guys doing with Spot? Sometimes this thing It can only be used for 5 minutes.

Should i able to launch a pod using nvidia/cuda docker images?

I am trying to start a pod using nvidia/cuda:12.6.0-cudnn-runtime-ubuntu24.04 (to get both cuda and cudnn). I'm not an a docker expert, but should that work? The pod appears to start, but the licensing message keeps looping in the logs, and i can't SSH into the pod? Any ideas? thx....
Solution:
No, you need to setup it yourself, and add a sleep infinity or a program that's running in the main thread, to make sure the container isn't looping start/stop

Connecting to Pod- Web Terminal Not Starting

Good Day, I am using my first pod and I have virtually no Linux skills. I created a pod and it's running fine. When I first created it, I connected by the "Start Web Terminal" and then clicked Connect and everything worked fine for about 30 minutes. Then it said I was disconnected. Tried to start web terminal again and it doesn't start and I cannot connect. Tried connecting via ssh on my windows box, bit it's asking me for the root password and I have no idea how to determine what that is....

Am I downloading directly from HuggingFace when I download models?

When I download a model from huggingface, am I using up their bandwidth or does runpod have some cache server that sits between my runpod and huggingface? I feel bad downloading from huggingface, bandwidth isn't free for them and all that.

Not 1:1 port mappings for multinode training

Hi I am trying to run multinode distributed training over multiple machines, but it isn't working. I think this because when I use the torchrun.distributed command and specify the port, if I choose the internal port the other machine send data to the wrong external port. If I choose the external port, my master node doesn't listen on the correct internal port. Is this a problem other people have had and is there a solution? Thanks in advance!...

How to override ollama/ollama image to run a model at startup

Hi, I´m trying to run pods using the ollama template (ollama/ollama) and trying to override the default template to during pod creating serving the model that I want. I tried to use ./bin/ollama serve && ollama run llama3.1:8b command into "container start command" but it doesn´t work. Any way to do this? Thanks!...

How to send request to a pod ?

Hello, for various reasons, I have a docker image that I want to run on a pod with a flask api. Except that I can't send any request, locally everything works but as soon as I put it on a pod it's problematic. First of all, I can't get a public IP address, so I thought I'd go through https://{pod_id}-{internal_port}.proxy.runpod.net/ping, but still, my requests don't work. So I tried using nginx to redirect requests to my container's internal port, but again, I'm having a bit of trouble with nginx, it doesn't seem to work. Is there something I've misunderstood about how to use runpod?...

Stoped Pod Price

When we use runpod.stop_pod(pod['id']) to stop a pod, and the pod's status becomes "stopped", how is the pod billed in this state? Is the GPU resource fee still charged?

Looking for best A1111 Stable Diffusion template

Anyone know any custom templates for Stable Diffusion A1111 that have ADETAILER and CONTROLNET extensions pre-installed?

No A40s available

Been checking all throughout the day, but no A40s are available. Anyone know why?

community cloud spot POD

A spot instance suddenly automatically switched to an on-demand instance. Is this normal? Also, when downloading Docker images, it often fails or becomes slow. (The speed variance when downloading Docker images each time a pod is created is too large (depends on luck)). Is this normal? Is there a way to minimize this? I host an average of 20 RTX 4090 instances for about 12 hours a day, automatically removing or adding pods to match demand. I'm curious about situations where Docker image downloads suddenly fail and about the behavior of spot instances...

Does the pod hardware differ a lot in US?

Hi, We deployed several times in US region (secure cloud) with runpod cli, but the inference performance/speed differs a lot, even model loading time differs a lot, what's the reason? and how do I know what data center I'm using. it only shows 'US'. thanks...