RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

GPU requires reset

Restarted and re-created the pod a couple times, getting the same error on container start. I assume it keeps grabbing the same bad node. I was able to start the container by switching to a different instance type. 2024-08-26T21:15:45Z error creating container: nvidia-smi: parsing output of line 5: failed to parse ([GPU requires reset]) into int: strconv.Atoi: parsing "": invalid syntax Pod ID: 2hvpqmtrowunjp...

Problems updating admin passwords on kasm image

I'm trying to change the default admin and kasm passwords on a kasm instance using the image runpod/kasm-docker:cuda11 once the pod is running, I login via ssh and successfully use passwd to change the admin password. Then I successfully change the kasm password using vncpasswd -u kasm_user then when i login using kasm, i can login successfully but the screen is completely gray and the cursor doesn't appear. something's broken and i have no clue what it is. ...

No A40s available?

I have my pod on an A40 and -a lot- of material i've downloaded onto it, but the A40 gpu's have been taken up all night. Is there anyway to quickly transfer all my downloaded material to another pod, or will the lack of availability be solved quickly?...

kernel dying issue.

Starting today, the kernel has suddenly stopped working properly, and it keeps dying or failing to run. I need to quickly check the results, but all my work has come to a halt. I need a quick response regarding this kernel dying issue.

Running out of disk space

I am trying to load a large dataset to train my model. How do I increase the available disk space of my pod?

Interested in multinode training on Runpod

Hi guys, my team is interested in using RunPod for multinode training. We are looking for 24-96 a100s for larger scale model training. Do you guys currently support this?

Continuous Deployment for Pods

Hello, I recently transitioned from using Serverless Endpoints to Pods, but I'm encountering issues with my existing build and deployment workflow. Previously, with Serverless Endpoints, I had a setup in GitHub where I used GitHub Actions workflows to build container images, push them to my registry, and update the template image reference via the GraphQL API. When I updated the template, the endpoint would automatically restart and pull the new image. However, with Pods, this behavior doesn't seem to work the same way. Even after updating the template, the Pod continues running the "old" image and doesn't refresh automatically. Could you suggest a method to trigger a dynamic update or replacement of the Pod? Additionally, are there any other deployment strategies you recommend for my situation? I appreciate your assistance! ...

Production pod suddenly unreachable, how long can I expect this to last for? (Please provide ETA)

Hi, I have an On-Demand Secure Cloud pod that runs the backend for my app. My app is now not working, and the pod has the message in the screenshot. How long can I expect this to last for? Minutes? Hours?
No description

Test Support Thread

Test Support Description
No description

Maximum number of A40s that can run at one time

I'm looking to run as many A40s to finish a large-scale inference/LLM generation job. How many could I run at one time? 40, 80, 100?

Cannot SSH over exposed TCP (multiple pods, tested from different local machine)

Hi @here I cannot SSH over TCP but is able to do basic. I suspected my Docker at first, but I have the same issue with multiple Docker image. I tested it from multiple local machine. This is the verbosed error message: debug1: Reading configuration data ~/.ssh/config...

Does RunPod support other repos other than Docker Hub?

Wodering if we can use AWS or GitHub as an alternative

Persistent container disk

Is there a way to make the container disk mounted at / persistent for a pod instead of the additional drive at /workspace or whatever?

How to avoid Cloudflare timeouts on pods?

I saw a previous post mentionning using the public IP but it doesn't seem to work for me? I'm using runpod to host a vLLM server (the serverless endpoint doesn't work for me). I'm running batch workloads and those timeout (cloudflare)...

Environment variables in direct SSH

Is there a way to access environment variables defined in the web app in an SSH connection over exposed TCP port?

How does runpod handle pod terminating

It is very likely that runpod simply sends a sigkill to the main container process. This is really annoying when you are trying to handle termination. Could you please provide information on how your orche system handles pod termination and how I can get the OS signal

KoboldCpp - Official Template broken

I've tried to launch the KoboldCpp template a few times, but am hitting errors. The model I want to use downloads in two parts (split with commas in launch arguments). The downloads finish and append, but the logs show 'rm: cannot remove './mmproj.gguf': No such file or directory' right before it finishes. The container then restarts and the downloads begin again from square one. These same models worked the last week. I have saved the entire logs if needed.

Secret now showing up in the pod `env` output

hi, i added some secrets and added those secrets as environment variables for my pod, but i couldn't see it when i run env in my pod, i'm using {{ RUNPOD_SECRET_secret_name }} as the environment variable value...
No description

transfer data of a stopped pod to a new one

hey i finished my training on a big pod and i want to share all the data to another pod using the storage (network volume) how can i do that?