RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

GPU Pods in EU-SE-1 unexpectedly die after approximately 30 hours

We are experiencing many instances of GPU pods (mainly A6000) that stop working after 30 hours losing also the VRAM content. We have repeatedly reported these issues but still there is not a solution since it keeps happening. We have left a pod on (ID : cxquttq3m3kqvl) for you to debug, can you please help? Thanks...

cpu instances don't work

2024-06-05T20:19:37Z create container runpod/base:0.5.1-cpu 2024-06-05T20:19:38Z 0.5.1-cpu Pulling from runpod/base 2024-06-05T20:19:38Z Digest: sha256:7530e77d6014bd6f3e1939b8d9003d8f7d2bd35a98395c4d297ac3b7a6d05b85 2024-06-05T20:19:38Z Status: Image is up to date for runpod/base:0.5.1-cpu 2024-06-05T20:20:38Z error creating container: container: create: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.43/containers/2fc1150401eeace7c2f58423e071f9686d6faaa89c28c7f50cf249b8b3f5ada4/start": context deadline exceeded...

Networking Multiple Pods Together

I'm looking to train a distributed model on runpod. When configuring the torch.distributed or jax.distributed you provide a coordinator_address of the form ip:port. Right now I'm unable to confirm that two pods can communicate with one another. I start one pod expose a 70000 level port, ssh into it, run ip route to get the local IP, then start a simple python server python -m http.server 70000. Then SSH into the other pod and run curl <pod_1_local_ip>:<pod_1_70000_port>. This consitently fails. My intuition is that the docker containers don't belong to the same network, to my knowledge we users don't have the privilege to setup such a network on the datacenters machine, only modify containers on a one off basis. Any guidance on enabling communication between pods would be greatly appricieated!...

Docker Image For RunPod Pytorch 2.0.1 Template

Hello, I'm trying to create a custom template which just adds a daemon to the official RunPod Pytorch 2.0.1 template. How can I find the docker Image that is deployed with this template?...

Can I use torch2.3.0 + cuda 11.8 on Runpod?

I want upgrade my touch version to 2.3.0, can it works on Runpod ?

is cuda not working?

It gets stuck here forever... Please help
Solution:
Not sure what caused the problem. Solved it by deploying another instance on the community cloud template: runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04...
No description

Not able start Nginx

I have logged in via Basic SSH and install nginx but im not able to curl. Please help me resolve this. ```...
Solution:
This is resolved. The template had default nginx.conf was changed and it didn't load the sites-enabled config.

Not able to ssh via "Overexposed SSH"

I am able to login with the basic SSH but Over exposed asks me for password. This is not working ``` ⬢ ❯ ssh root@xxx -p 13776 -i ~/.ssh/id_ed25519...
Solution:
You can use OhMyRunPod

Can not kill processes

gpu pod - sercure cloud
Solution:
I would just reset the pod to kill the processes
No description

container start command

I have created a startup.sh script that I want to use as the start command for my container. The script needs to do two things: Start a Python .py file Keep the container accessible through the web terminal after starting the Python script ...
Solution:
fabulous thank you!

Too many Open Files Error on CPU Pod - Easy Repro

@flash-singh I think I found an easy repro for the too many open files on CPU Pod: 1) Use the following docker: (you don't necessarily need to do this, it just what I am using for an exact repro) justinwlin/runpod_pod_and_serverless:1.0 ...
No description

recipes

Hello, I've been trying to look up some recipes on https://docs.runpod.io/recipes . However, it seems to be down. Does anyone know anything about it? Thanks a lot!...
Solution:
Most of those have been moved here: https://docs.runpod.io/sdks/graphql/manage-pods...

how do you create a compatible docker file?

I want to run a custom docker file, but I'm not sure how to make one that's compatible. for example when I use this to create an image that's saved to my registry, the pod seems to start but I can't connect to it over ssh. I noticed that if I picked an official pytorch pod I had checkmarks for ssh and jupter lab, but not if I use my custom one. What's the minimal dockerfile I need to run? ```dockerfile...

Strange unix and/or user perms issue with command in dockerfile/replacement command

I have a bash script in my pod which, as part of its last command, executes mpirun with some target process. When running this command using bash <script> as the dockerfile's entrypoint, or using runpod's replacement command, the following issue occurs: ```2024-06-04T00:27:41.763661289Z Per request, Open MPI attempted to set a system resource 2024-06-04T00:27:41.763672184Z limit to a given value: 2024-06-04T00:27:41.763682241Z ...

Console for kohya_ss / Stable Diffusion

Is there a way to access the concole for running processes in prebuilt pods? I am running kohya_ss and Stable Diffusion and would like to see what’s going on “behind the WebUI layer”. Any help is greatly apprechiated. 🙂

NVLink support for H100 NVL

When I execute the nvidia-smi topo -m method on the H100 NVL * 2 pod, I can see the PIX topology between GPU0 and GPU1. Can I use NVLink connection to interconnect the H100 NVL GPUs? How does the PIX(PCIe bridge) performance differ from NVLink?

question

Hello, we have a scheduled downtime to remove a machine and reinstall the entire operating system, and I see that there is a process running on it. I'm not sure what to do if I format the machine and reinstall the operating system. But of course, the running process will lose all data.

How do I raise a support ticket?

I cannot interact with the Email Support button on the website, and I have received no response on Discord either. I submitted feedback a week ago here: https://discord.com/channels/912829806415085598/1243604870074732595 We are scheduled to go live in about a week, and the general lack of support is very concerning....

Cloud Files Updating Backblaze

After I upload my files to Backblaze and I decide later to add some more stuff to the workspace is there a way to update the Backblaze cloud with only the new files without deleting and reuploading them?
Solution:
Re backup them or upload them manually works

Pod GPU keeps disconnecting...

i create a pod and when i finish my work the next time i open it the gpu is not available and i have to reinstall from the beginning the whole Fooocus and loss all my downloaded checkpoints and stuff... is there a way to fix this by having my files stored somewhere safe and just connect them with the pod? and how should i do that? please be as specific as possible im beginner.