RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

nvidia-glx-desktop - how to make it work

Hello there 😉 I am trying to play with this docker container docker-nvidia-glx-desktop ( I am newbie in containers). I also know that runpod is not designed to stream desktops like this one, and this service is more-less oriented for different purposes. But I found out that this is working well with most of runpod instances. For most of them, even no TURN is required and it is working right from the deployment. There is only one showstopper. It requires at least the driver version 535.129.03. When the host driver is below that, it is not working, because of driver issue. It would be great if you can update the drivers to at least this package to make this container work 🙂 Thank you very much....

need SU password for the RunPod Desktop template 'runpod/kasm-docker:cuda11'

please help, i need to install yaml. please note, i do NOT require the vnc password...

Custom template creation with AWS ECS

Hi! Is it somehow possible to create a custom template using amazon elastic container registry*? I need to somehow authenticate on aws but in credentials form I have only username and password options (I believe for docker login). Is it possible throught cli maybe? Thanks, waiting for an answer...

When trying to git pull Comfy nodes into my RunPod, I'm met with a divergent branch error?

I'm looking to install multiple Comfy nodes, but I'm receiving an error. How can I fix? root@b10b7d37a80c:/workspace/ComfyUI/custom_nodes# git pull https://github.com/ltdrdata/ComfyUI-Manager remote: Enumerating objects: 5803, done. remote: Counting objects: 100% (568/568), done....

Running 2x H100 80gb. Does this mean my cap is now 160gb of vram?

I'm doing some vfx work on 8k footage. Right now with the 2x H100's I really only get things to work at 2500x2500. I get an error when I pump a 4k image in that says my vram is 80gb. So I'm assuming doing two H100's means it won't combine?

GPU cloud template to manage network volume

hi, I want to start up a GPU cloud instance for a short while to manage the files on my network volume. Is there a template that provide file management commands e.g. list, delete, move, copy, copy to s3

Cache a Docker image to reuse

Is it possible to Cache a Docker image so I can reuse again without downloading from beginning when selecting a new Pod ?

RTX3090 is available on the selection page but my stooped pod is still 0 gpu

RTX3090 is available on the selection page but my stooped pod is still 0 gpu

Issue installing Foocus Runpod

first time i installed foocus i had no issue but now for few days i tried many times and i get this error and cant proceed.
No description

sh: 1: accelerate: not found

today im getting this error when using Kohya to caption images or trying to train a LoRA using a Stable Diffusion Kohya_ss ComfyUI Ultimate Pod. I have been using these pods for the last month without issue, but today nothing is working. My work flow has not change. I was successfully creating 2 days ago. Any ideas?
Thanks sh: 1: accelerate: not found ...

A way to connect to an AWS VPC

Our stack is currently running in AWS on EKS. We are currently using 3rd party providers for our GPU workloads connected via a site-to-site VPN. I want to know if runpod has a solution to connect to an AWS VPC....

8x H100 SXM5, Error 802

I'm getting an "Error 802: system not yet initialized" on an 8x H100 SXM5 community pod. Running nv-fabricmanager gives this error: # /usr/bin/nv-fabricmanager -c ~/nvswitch/fabricmanager.cfg request to query NVSwitch device information from NVSwitch driver failed with error:Failed to load the requested module [NV_ERR_MODULE_LOAD_FAILED] ...

Attaching a Network Volume fails when using GraphQL

I have created a Network Volume and would like to start a container with the volume attached. It works without problem in the web UI after clicking the "Deploy" button. However, when using runpod-python's method create_pod, GraphQL endpoint returns the following error: There are no longer any instances available with the requested specifications. Please refresh and try again. (I have tried multiple times) Here is the minimal code: ``` pod = runpod.create_pod(...

Container logs disappear after stopping the container

I have this script (src/test.sh): ``` echo "Working ..." sleep 10 runpodctl stop pod "$RUNPOD_POD_ID"...
Solution:
we do not store logs once you stop the pod, they're only meant to be there for the life of running pod, you can push your logs to another service if needed, we may plan to have a central logging place for this in future
No description

CUDA 12.3 support

I created a template with a custom image (based on runpod/containes) to run CUDA 12.3, but when I use pytorch 2.1.2 + python 3.10, it tells me that it's not working. ```bash python3 -c "import torch; print(torch.cuda.is_available())" ...

Is there a way to get pod logs programmatically?

After creating an on-demand pod via GraphQL API I'd like to get access to the pod's logs without using the UI.

GPUs look available via `runpod.api.ctl_commands.get_gpu()` which aren't available.

I'm currently trying to find which types of GPUs are available (in order to programatically decide what GPU type I want). I saw that there is a runpod.api.ctl_commands.get_gpu() function which calls the graphql api, but the information it returns seems inconsistent with what's available. For example, right now. I can run...

Serverless endpoint long waits in "Initializing" state

Requests to a serverless endpoint at /run have an "Initializing" status in the dashboard for up to 15 minutes. Is this a normal queue time for an endpoint with no other requests?