RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡｜serverless

⛅｜pods

Berserk

3/26/2025

Is it possible to get pod logs from REST Api or GraphQL?

Arsen32

3/26/2025

Why am I unable to connect to a http server?

For some reason this randomly occurring now when I try to connect. It was working fine before in a previous pod. Cant even click start because the start button wouldn't work either. It would try to connect just to go back to the green start button. Happening on every pod I go to.

Arsen32

3/26/2025

How do I install a model with kobold ai?

I am having trouble. It is stuck at this for awhile when I been using my pod. It downloaded 64 gb model without issues, but once it started loading model tensors it stopped at 330/664. What can I do to fix this? I am lost. The loading bar is still occuring....

jphipps

3/25/2025

L40 Thermal throttling

We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU ...

natasha

3/25/2025

New to runpod. Have never even coded in my life

Hey Guys, I just signed up for Runpod and when I hit the Explore button it only shows me "official templates". No community templates pop-up. I want to create portraits in Flux

｢The Nightmare Keeper｣

3/25/2025

question about price of gpu pods

hey, i had a question about the pricing of the pods, i tried an other hosting service but they charged me even if the server was down but just created, and i was wondering if RunPod will do the same ?

Jamb

3/25/2025

Runpod occasionally fails to pull from ECR

Every now and again I have issues starting a pod as it fails to pull from AWS ECR. Nothing in my setup changes. ```error pulling image: Error response from daemon: Head "https://<AWS_ACCOUNT>.dkr.ecr.<region>.amazonaws.com/v2/<repo>/manifests/<container>": no basic auth credentials error creating container: container: create: container create: Error response from daemon: No such image: <AWS_ACCOUNT>.dkr.ecr.<region>.amazonaws.com/<repo>:<container> create container <aws_account>.dkr.ecr.<region>.amazonaws.com/<repo>:<container>...

Milad

3/24/2025

Dead instance on launch

This has happened to us a few times recently. We launch an instance and it is fully dead after the image is pulled. We can't connect to it, there are no CPU, Memeory or other stats. The terminal gives no options

redparis

3/24/2025

GPU seems to have stopped...logs don't show any errors, but there is no activity

The pod id is sluqyzp1j6z48n My network volume is attached to this location, but I keep having issues with the A6000....

onizukaek

3/24/2025

Migrate pod volume to Network volume

Hello, I'd like to create a network volume to avoid the unavailable GPUs issue but I already installed some stuff in my current pod. Is there a migration option somewhere?

Jamb

3/24/2025

Unable to modify owner of network volume

Hey all, I'm attempting to create a network volume and mount it to /home inside the pod, attempting to create a user home dir. However, I am unable to change the owner away from root.

justkam

3/24/2025

Can't run extensions in stable diffusion

Since 10h I am sitting and trying to use anhy extention in stable diffusion official pod and I can't. They don't show in tabs but I see them on the list of extentions 😦 ANy help?😩

feesta

3/24/2025

Cuda not connecting to image provisioned for GPU

Started a community pod with 1 GPU (4090) using the Runpod pytorch image/template (runpod/pytorch:2.4.0-py3.11-cuda12.4). Immediately after starting pod, GPU is unavailable even though nvidia-smi seems to see the GPU. This is happening about 20% of the time I start images with this official container. No errors thrown in system or container logs. root@5c367a0d4ea2:/# python -c "import torch; print(torch.cuda.is_available())" /usr/local/lib/python3.11/dist-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0...

ajgeiss0702

3/24/2025

Requests using RUNPOD_API_KEY fail with 403 unauthorized.

Hello, I'm experimenting with using runpod for running a bunch of one-off jobs. According to the [pods environment variables] page, the RUNPOD_API_KEY is an api key for making api calls scoped to the specific job. Basically, I want to terminate (or at least shut down) the pod once it is done with its task. However when I make a call to the rest api, I get 403 Forbidden and an empty response body....

Michael C.

3/23/2025

run commands remotely on my pod

Hi I've been trying for an hour to run bash command on my Pod via python. Nothing seems to work. Tried fabric, paramaiko. runpodctl has this command:

$ runpodctl exec python /ru.py --pod_id <redacted>
Running remote Python shell...
Waiting for Pod to come online...

$ runpodctl exec python /ru.py --pod_id <redacted>
Running remote Python shell...
Waiting for Pod to come online...

But it just hangs there and does nothing...

mg_wizarch

3/23/2025

Flux Gym

Hi, I'm running the FLUXGym template and all seems to be working fine but I realized there is no way to access the directory structure, just the web interface. How do I get to the files to do some maintenance of the volume?

serial_winner

3/23/2025

Http bad gateway error

I'm getting this error when I click the http service Any clues as to why I have this error ??...

3/21/2025

LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log

I have been trying keep my LLM finetuning process alive unsuccessfully. I am using 4 V200 GPUs w/ Pytorch FSDP. The process tends to crash when saving checkpoints, BUT not always. I removed the checkpoints and now it's crashing in the middle of the training loop, somewhat randomly. This is what's in my nohup.out: {'loss': 0.1151, 'grad_norm': 2.4503021240234375, 'learning_rate': 5.616492701703402e-07, 'mean_token_accuracy': 0.9721812009811401, 'epoch': 3.42}...

PedroRamalho

3/21/2025

2 GPU but only one work

I have a 2 x RTX A5000, i run a notbook and get erro not enought memory. When i go to dashboard only one is used. How can i use both in same notebook?

martinmartyzzz

3/20/2025

deploy fail, can't get template, networking, could not resolve host github.com

I am new to runpod. Have been using same template successfully for two days. I am having OOM errors so I went for a bigger machine, same data center. I changed from L40S to H100SXM. Same data center, TX3. Also I changed from persistent "network volume" to temporary since I always ended up having to recreate every thing anyway. With the L40S I never had to do anything to set up networking. Anybody know why the bigger machine would be giving me a problem getting my template from github? Like I sai...

Previous Next

Gaming

Programming

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!