RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Problems SSH'ing multiple times, lost ssh keys?

Has anyone experienced issues SSH'ing into a runpod machine multiple times? I have a terminal already ssh'd into the machine (which has a public IP), but now other terminals are requesting a password at login? I'm on Macos with ZSH and now my publickey is not working without any changes to the runpod container authorized keys? I can literally cat ~/.ssh/authorized_keys in one terminal on the remote machine and verify that the keys are present, but in other terminals I'm unable to log in....

Venv not found

So I have a network volume which I use to run pods for ComfyUI and I had created a venv in it. It was working fine for few months but now suddenly it shows error bash: venv/bin/activate: No such file or directory I dont have my venv anymore?...

A1111 Stable Diffusion 1.10.0 Pod filling up disk immediately

I added around 10GB of space to the pod after failing to boot once, and it immediately fills up to 100% with stuff like this showing up on container. The same exact Storage Volume worked to boot the pod OK yesterday. I would like to keep all my LORAs and settings, but this is annoying to deal with....
No description

Unable to start pod with MI300x

Observing "hang" when starting pod with 8xMI300x, screenshot attached. Any ideas on how to fix this?
No description

Exposing port not working

I'm trying to create embeddings using infinity. There is already a docker container for that: https://hub.docker.com/r/michaelf34/infinity Now I've tried to launch it and expose port 7797. However, I can't reach the container via the proxy:...
No description

Error after restarting the containers.

Command : docker compose up Error: WARN[2024-07-30T12:12:22.042930970Z] Controller.NewNetwork mia-runpod-backend_default: error="failed to create DOCKER-USER IPV6 chain: iptables [+] Running 3/4es --wait -t filter -N DOCKER-USER: ip6tables v1.8.4 (legacy): can't initialize ip6tables table `filter': Table does not exist (do...

ULTIMATE Stable Diffusion Kohya ComfyUI InvokeAI

doesn't start properly looks like its creating the stable diffusion container 4 time in a row
No description

Anyone Getting Bad Pods with Internet Issues?

I'm in US, and I get a lot more bad pods with internet issues than working pods like 7 out of 10. I'm trying to spot a community pod with rtx 4090 and the default template pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04. When I get a bad pod, I get error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) If the pod runs and if I connect via ssh and try to setup, I often run into problem with apt on ubuntu and python pip. Sometimes I get certificate error, extremely slow speed less than 10 bytes per second, etc. I have to keep launching different pod until I get a working one. Anyone has the same problem?...

Creating a pod by extending another pod

The existing Comfy pod is very basic, so each time I need to run my huge flows I would either have to reinstall all required custom nodes from scratch, or install them once and pay for disk storage. Would it be instead possible to create a new pod after I install all my required nodes, so I just deploy my pod with all required dependencies later?

Container Logs via the API or SDK

As far as I can see, there is no way to access container logs via the API, correct?

Training jobs using script

Hey, Can anyone tell me if runpod gives the feature to create a training script that can be run from anywhere and I can use that to create a GPU instance, and load and save my data to external cloud storages just like in AWS Sagemaker training script mode? I need to train multiple models in such manner with different architectures to see which one performs the best.

Possible to terminate pod from Within the pod?

I know you can terminate pod from “outside” with runpodctl- but are there any options for a pod to self-terminate, triggered by its own docker image? Or, am I approaching this wrong and ‘best practice’ is to have your pods giving status updates back to and being managed by a script on your main PC w/ runpodctl?...
Solution:
One of them is RUNPOD_POD_ID which you can use to remove/terminate/kill the pod.

Install the dependencies issue

anyone can tell me why now i get this error ?
No description

How is runpod secret / environment vars for credentials more secure?

I'm looking at the runpod Secret feature for handling AWS credentials. It looks like 'best practice' for handling credentials in a docker image is to set them as environment variables; and Runpod's "Secrets" feature feeds into that. Could anyone explain how using runpod's "Secrets" is more secure than just passing environment variables? If the security concern is to avoid writing your credentials directly into the image and instead pass them on launch with env vars, how do "Secrets" do anything more? Is it a feature for handling credentials within a runpod account managed by a team?...
Solution:
Yes, they are meant to keep keys secure in a team environment. With ENV variables all team members could view your keys in clear text in the template definition.

Get SSH Login Via API

When getting a pod via the API, it does not return any information on connecting via the Basic Terminal Access. Obviously the first bit of the username is pod ID but I haven't been able to identify the numbers proceeding after the dash. How might you get this username via the API or programmatically? ssh [email protected] -i ~/.ssh/id_ed25519 cbdf4581hxb1vy == pod ID...

Llama3 setup

Hi, everyone. We are planning to deploy Llama3 for our app with millions of users. How can we achieve this? And which GPU series or cloud platforms are best for achieving high speed and scalability?...

BROKEN: TheLastBen Fast Stable Diffusion

2024-07-26T18:04:17.934600984Z --2024-07-26 18:04:17-- https://huggingface.co/datasets/TheLastBen/RNPD/raw/main/Notebooks.txt 2024-07-26T18:04:17.960726061Z Resolving huggingface.co (huggingface.co)... 65.9.95.31, 65.9.95.61, 65.9.95.114, ... 2024-07-26T18:04:17.964895834Z Connecting to huggingface.co (huggingface.co)|65.9.95.31|:443... connected. 2024-07-26T18:04:18.292202440Z HTTP request sent, awaiting response... 401 Unauthorized 2024-07-26T18:04:18.292233330Z ...
Solution:
Template has been pulled for a while already because RunPod cancelled the contract with TheLastBen so he removed the files from his repo and broke it.

Network volume

Hi guys, I am new to Runpod. I am trying to set up a network volume, but I cannot see the "Connect to Jupyter Notebook" option after I deployed the GPU within the network volume. What did I miss?

network volume

Hi guys, I am new to Runpod. I am trying to set up a network volume, but I cannot see the "Connect to Jupyter Notebook" option after I deployed the GPU within the network volume. What did I miss?

ollama won't pull manifest - weird error.

In a runpod I've tried the various ollama templates, and also installed ollama on a basic template. I can run ollama serve; but in every case when I run ollama run <model> I always get the error: Error: pull model manifest: Get "https://registry.ollama.ai/v2/library/mistral-large/manifests/latest": dial tcp: lookup registry.ollama.ai on 127.0.0.11:53: read udp 127.0.0.1:59647->127.0.0.11:53: i/o timeout ...