RunPod

R

RunPod

Join the community to ask questions about RunPod and get answers from other members.

Join

⚡|serverless

⛅|pods

Just found a security issue on runpod

Hi guys, i've just found a security issue on runpod, where can i report it? does runpod have a bug bounty program?

Is it possible to get pod logs from REST Api or GraphQL?

Is it possible to get pod logs from REST Api or GraphQL?

Why am I unable to connect to a http server?

For some reason this randomly occurring now when I try to connect. It was working fine before in a previous pod. Cant even click start because the start button wouldn't work either. It would try to connect just to go back to the green start button. Happening on every pod I go to.

How do I install a model with kobold ai?

I am having trouble. It is stuck at this for awhile when I been using my pod. It downloaded 64 gb model without issues, but once it started loading model tensors it stopped at 330/664. What can I do to fix this? I am lost. The loading bar is still occuring....
No description

L40 Thermal throttling

We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU ...

New to runpod. Have never even coded in my life

Hey Guys, I just signed up for Runpod and when I hit the Explore button it only shows me "official templates". No community templates pop-up. I want to create portraits in Flux

question about price of gpu pods

hey, i had a question about the pricing of the pods, i tried an other hosting service but they charged me even if the server was down but just created, and i was wondering if RunPod will do the same ?

Runpod occasionally fails to pull from ECR

Every now and again I have issues starting a pod as it fails to pull from AWS ECR. Nothing in my setup changes. ```error pulling image: Error response from daemon: Head "https://<AWS_ACCOUNT>.dkr.ecr.<region>.amazonaws.com/v2/<repo>/manifests/<container>": no basic auth credentials error creating container: container: create: container create: Error response from daemon: No such image: <AWS_ACCOUNT>.dkr.ecr.<region>.amazonaws.com/<repo>:<container> create container <aws_account>.dkr.ecr.<region>.amazonaws.com/<repo>:<container>...

Dead instance on launch

This has happened to us a few times recently. We launch an instance and it is fully dead after the image is pulled. We can't connect to it, there are no CPU, Memeory or other stats. The terminal gives no options
No description

GPU seems to have stopped...logs don't show any errors, but there is no activity

The pod id is sluqyzp1j6z48n My network volume is attached to this location, but I keep having issues with the A6000....

Migrate pod volume to Network volume

Hello, I'd like to create a network volume to avoid the unavailable GPUs issue but I already installed some stuff in my current pod. Is there a migration option somewhere?

Unable to modify owner of network volume

Hey all, I'm attempting to create a network volume and mount it to /home inside the pod, attempting to create a user home dir. However, I am unable to change the owner away from root.

Can't run extensions in stable diffusion

Since 10h I am sitting and trying to use anhy extention in stable diffusion official pod and I can't. They don't show in tabs but I see them on the list of extentions 😦 ANy help?😩

Cuda not connecting to image provisioned for GPU

Started a community pod with 1 GPU (4090) using the Runpod pytorch image/template (runpod/pytorch:2.4.0-py3.11-cuda12.4). Immediately after starting pod, GPU is unavailable even though nvidia-smi seems to see the GPU. This is happening about 20% of the time I start images with this official container. No errors thrown in system or container logs. root@5c367a0d4ea2:/# python -c "import torch; print(torch.cuda.is_available())" /usr/local/lib/python3.11/dist-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0...

Requests using RUNPOD_API_KEY fail with 403 unauthorized.

Hello, I'm experimenting with using runpod for running a bunch of one-off jobs. According to the [pods environment variables] page, the RUNPOD_API_KEY is an api key for making api calls scoped to the specific job. Basically, I want to terminate (or at least shut down) the pod once it is done with its task. However when I make a call to the rest api, I get 403 Forbidden and an empty response body....

run commands remotely on my pod

Hi I've been trying for an hour to run bash command on my Pod via python. Nothing seems to work. Tried fabric, paramaiko. runpodctl has this command:
$ runpodctl exec python /ru.py --pod_id <redacted>
Running remote Python shell...
Waiting for Pod to come online...
$ runpodctl exec python /ru.py --pod_id <redacted>
Running remote Python shell...
Waiting for Pod to come online...
But it just hangs there and does nothing...

Flux Gym

Hi, I'm running the FLUXGym template and all seems to be working fine but I realized there is no way to access the directory structure, just the web interface. How do I get to the files to do some maintenance of the volume?

Http bad gateway error

I'm getting this error when I click the http service Any clues as to why I have this error ??...
No description

LLM training process killed/SSH terminal disconnected, seemingly at random, no CUDA/OOM error in log

I have been trying keep my LLM finetuning process alive unsuccessfully. I am using 4 V200 GPUs w/ Pytorch FSDP. The process tends to crash when saving checkpoints, BUT not always. I removed the checkpoints and now it's crashing in the middle of the training loop, somewhat randomly. This is what's in my nohup.out: {'loss': 0.1151, 'grad_norm': 2.4503021240234375, 'learning_rate': 5.616492701703402e-07, 'mean_token_accuracy': 0.9721812009811401, 'epoch': 3.42}...

2 GPU but only one work

I have a 2 x RTX A5000, i run a notbook and get erro not enought memory. When i go to dashboard only one is used. How can i use both in same notebook?