RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

"We have detected a critical error on this machine which may affect some pods." Can't backup data

During a training run with 8xH100, I started seeing strange "Directory not found" errors in my jupyter notebook which could not be dismissed (they kept popping up). Although my training run continued and completed, I wasn't able to copy the data off of the volume disk due to the modals blocking operation. I looked into the deployment and saw the error "We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime." Unfortunately everything I've tried to get my data doesn't work - reconnecting to the notebook, Web Terminal, SSH (both options), and even stopping and starting the pod fails. ...
No description

Operation not permitted - Sudo access missing

Hi, I am currently trying to install python3-venv on my runpod instance. However I am getting bunch of sudo: setrlimit(RLIMIT_NOFILE): Operation not permitted messages and ultimately the install finishes with ModuleNotFoundError: No module named 'apt_pkg' However python was not installed. If I try sudo -v it shows:
sudo: setrlimit(RLIMIT_NOFILE): Operation not permitted
sudo: setrlimit(RLIMIT_NOFILE): Operation not permitted
...

Download Mixtral from HuggingFace

How can I download this model in my pod ?
No description

Is there a way to run more than 1 image in a pod?

I would like to add monitoring sidecar container running inside a pod, along side with the app container. Is there a way to do this?

Slow model loading over some instances

I am using ComfyUI, some pod instances take extremely long time to load a model. I am using A100 and H100. For testing, I tried to load a simple diffusers pipeline on the same pods, they also load very slow. I have tried different torch versions, different cuda versions too...

ulimit increase?

I have a pod that runs a binary and tries to set ulimit but is failing, any way I can increase?

Can we turn secure cloud instances on/off through some time of trigger function?

Hello everyone! Was wondering if I need to be paying for the pod 24/7 even if i will only be using the llm a couple of times per day, or if it can be turned on at certain times

How can I do scheduled backups with Azure using API?

I know about Cloud Sync, but how do I call it from my app?

Failed to Import Libraries on Runpod SD ComfyUI [RTX A 4000]

- hey guys every time I boot up my comfyUI runpod it always fails to load a few libraries and trying to update/fix them from the comfy manager doesn't seem to resolve the issues - I repeatedly install the individual dependencies but everytime I feel like the same modules come back as "module not found", I've looked at a few other solutions/threads but have been struggling to get this to work - anyone else face the same issue?...

How do I select a different template to the default in the new RunPod UI?

I might be missing something obvious, but: in the new RunPod Pods > Deploy UI, after selecting a GPU config, how do I pick a template other than the default? RunPod Pytorch 2.2.10 runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 ...

Can't open models/checkpoint folder in Jupyter for Comfy UI.

All the other folders open, but not the checkpoint folder. Want to install models from CivitAI. Is this normal on runpod or is it a template issue?

hello guys!I want to buy a RTX4090 pod,but the 46G Ram is not enoght.Is there anyway to upgrade ram?

i hope to buy a 64 g ram pod with rtx4090.need helps
Solution:
@rondos1701 wait if you are still on this try the filter thing

Am I able to host an app through reverse proxy with a custom domain name?

I have a domain name that I own and want to run my app with ssl through port 443. Is this possible to do on a pod? I am trying to run my gradio based app with Nginx and I cant seem to get it to work with a custom domain name.

Is it possible to change region of a network volume?

Would like to access high VRAMGPUs, which arenot available in EU-RO-1

How do i add cronjob in a pod?

I am using a pytorch image for my pod. I have cloned my repository and created an environment, as well as an app that is exposed on specific port. I only need to use this app for two hours per day, so when i want to use it, i manually start the pod, and after that, I manually stop it. I want to add something like cronjob, so that whenever i restart this pod, it will automatically run the specific commands and start my app

Can't connect to Civital lately when donig WGET commands, what am I doing wrong?

Username/Password Authentication Failed. root@8350c17f8def:/workspace/ComfyUI/models/checkpoints/sdxl#...

TensorRT-LLM setup

Has anyone been able to successfully install tensorrt_llm? I'm trying with pip, but I'm running into mpi related errors: Cannot open configuration file /build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi/share/openmpi/mpicc-wrapper-data.txt Error parsing data file mpicc: Not found...

Stable Diffusion Extension Installation Issues:

Hi! I'm new to this whole Discord and RunPod, so sorry if I've posted this in the wrong place or made any other mistakes. I've run into a problem when installing some extensions in RunPod. I've been trying to get [traintrain] (lets you create loras, not Kohya) and like [tagger] (which pulls tags from images) to work, but for some reason, RunPod just won't recognize them, no matter what I try. I even found a post on Reddit where someone was having a similar problem with the SD Dynamic Prompts extension not appearing on list or working at all. They said they tried turning off all the other extensions, but that didn't do the trick either....

Is it possible to make port 443 externally accessible?

Is it possible to make port 443 externally accessible? I want to remove the port number from the DNS name (https://example.com:34567). I have a solution in Cloudflare, but I need to access Cloudflare every time the pod is rebooted. Thank you
Solution:
nope it's not possible tcp ports are always random though you should be able to use cloudflare tunnels

Comfy launcher issue

Comfy launcher isn't downloading models or assets anymore. I wrote the dev on banodoco but isn't working for me anymore.
No description