RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Pod disappeared after yesterdays maintenance

Hello, yesterday I wanted to start my Pod but I got the message that the system was down for maintenance till my (local) midnight. That's fine, but this morning I wanted to try again and my entire pod is gone. I hope its possible to recover this because quite some time making it went into it. Checked if it was stale, it was getting close but still had 2 days on that timer. Also plenty of credits. Bit strange how it can just disappear... Guess I can try to recreate it but a lot of work went into that one. Hope it can be restored somehow....

How to enable Jupyter Notebook and SSH support in a custom Docker container?

I built my own docker image to deploy on a pod. After creating the Custom Template with my docker image, there is no option to enable Jupyter Notebook or SSH for it. I tried my best to imitate the official Runpod containers, by installing jupyterlabs and openssh-server, but when setting up the pod, there is still no option to enable Jupyter Notebooks or SSH on the pod. I am also not able to find any guides on how to incorporate Jupyter notebook support on a custom docker image....

open ports

I would like to open the posts in my instance, how do I do it?
Solution:
https://docs.runpod.io/docs/expose-ports maybe this doc can help?...

[Urgent] One GPU suddenly went away

Hi, we have prod issue right now one of the gpu from our pod suddently disappared

Does GPU Cloud service support Illyasviel/Fooocus AI?

My pc has low vram and always get disconnections from the fooocus ai, im interested to upgrade with a runpod gpu service, does it support the https://github.com/lllyasviel/Fooocus service?

Pod suddenly says "0x A100 80GB" and cuda not available

Hi, I created a pod a few days ago and worked with it, no problem. I stopped the pod after the session. Today I try again and suddenly it says 0x A100 80GB and the cuda is not available. If I look at starting a new pod it seems the A100 80GB is available in the same location, so why can't I start my pod with this GPU? What should I do? Is there a way to transfer the data to a new pod?...

Moving storage location

My storage drive is in region EU-CZ-1. But there are no pods available to launch. Is there anyway I can move my storage drive to another region?

is your network volume charged by actual usage or the fixed number keyed in during setup?

is your network volume charged by actual usage or the fixed number keyed in during setup?
Solution:
charged by the quota you ask for

Error 804: forward compatibility was attempted on non supported HW

Writing to the online chat bounces the messages, despite me being obviously connected.
No description

"We have detected a critical error on this machine...failing pods

I get a lot of this errors lately "We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime." I lost pods (H100 in the secure cloud) and don't know why, I had the 6th pod failing today in 2 weeks. Runpod support is not helping either. Someone can help me? I'm not going to use runpod's service anymore till this issue is adressed, thanks. Current pod failing: ID: jfktfsgsvw19i1...

Webhook URL

how i can pass webhook url on JSON body??
No description

stop pod

hello, i am kind of confused. i havent used runpod in a while. I want to stop my gpu instances, butif i select the trash button on my pods, it seems to want to delete the volume. I am using a volume and running secure cloud gpu's. Isnt there a way to terminate pod but keep all the data in the volume?
No description

How to transfer between pods?

I'm running stable diffusion and would like to transfer my outputs to a different pod to continue working. When using runpodctl to transfer data between from 1 pod to another, what is the command? I have tried using runppdctl send “file path name” but this isn’t working for me. What file path should I be using? Can someone share an example of the file paths structure, please? It was suggested I post the question here, I'm not getting an error, it's just that nothing is happening.
No description

Network connection

I launched two pods using secure cloud, and each pod need to communicate with the other. But when I check, they couldn't communicate to each other. How can I connect to another one? (region is same)...
No description

Multi-node training with multiple pods sharing same region.

I am trying multi-node training with multiple pods. When I launched multiple pods with same region, they share same public IP, but only port is different. How should I specify the proper port and IP for multi-node training? Does secure cloud offers multi-node training?...

Dev Accounts Adding Public Key

I'm an admin on a team account. Can dev accounts add public keys to the org?

Does Runpod Support Kubernetes?

My current understanding is that runpod only supports docker-images, in the sense that you (1) create a template, (2) reference a docker-image, and then runpod pulls that image and runs it as needed. However, what if I want to run a kubelet and have it join my kubernetes cluster as a node, and then have my k8s cluster place my own docker-images onto the node?...
Solution:
Hi there - can you advise how you got there? We definitely need to do something about that page, it's incredibly outdated 😅 As far as Kubernetes - no, no support for that, but if you are willing to rent out an entire machine of GPUs for a minimum time commitment for at least a few months, we can offer a baremetal setup instead...

Does GPU Cloud is suitable for deploying LLM or only for training?

I'm pretty new in RunPod, I have already build 4 endpoints on Serverless and it's pretty straight-forward for me, however I don't understand is GPU Cloud is also suitalbe for pure LLM Inferencing via API for chatbot purposers or it's only for training models and saving weights. The main question is that can I also deploy my LLM for inference on GPU Cloud for production? Where to find API on which I should make calls? Because I find Serverless very unstable for production, or maybe it's mine faul...

Issues with connecting/initializing custom docker image

I've created a custom docker image for quick ocr training; https://hub.docker.com/repository/docker/jeffchen23/paddleocr-image/general The problem is, everything downloads properly, but then I am unable to connect. When trying to connect, I get Permission denied (publickey); but the permissions are not an issue for any of my other pods. I think it is because the pod fails to initialize correctly, as it constantly spams messages of Start container. Can anyone help me pin down this issue? It works on my local machine when I pull it from the web. My local docker command is as follows: docker run -it --runtime nvidia --shm-size 2g --gpus all -v paddleocr-volume:/PaddleOCR paddleocr-image bash It doesn' t look like I have any direct control over the Docker command from RunPod (from what I can tell), so I'm a little lost....

Error occurred when executing STMFNet VFI: No module named 'cupy'

Running Comfy UI on runpod and hits this error. Can someone help provide the steps to install or update Cupy? Much appreciated!