Problems SSH'ing multiple times, lost ssh keys?
Has anyone experienced issues SSH'ing into a runpod machine multiple times? I have a terminal already ssh'd into the machine (which has a public IP), but now other terminals are requesting a password at login? I'm on Macos with ZSH and now my publickey is not working without any changes to the runpod container authorized keys?
I can literally cat ~/.ssh/authorized_keys in one terminal on the remote machine and verify that the keys are present, but in other terminals I'm unable to log in....
Venv not found
So I have a network volume which I use to run pods for ComfyUI and I had created a venv in it. It was working fine for few months but now suddenly it shows error
bash: venv/bin/activate: No such file or directory
I dont have my venv anymore?...
A1111 Stable Diffusion 1.10.0 Pod filling up disk immediately
I added around 10GB of space to the pod after failing to boot once, and it immediately fills up to 100% with stuff like this showing up on container.
The same exact Storage Volume worked to boot the pod OK yesterday. I would like to keep all my LORAs and settings, but this is annoying to deal with....
Unable to start pod with MI300x
Observing "hang" when starting pod with 8xMI300x, screenshot attached. Any ideas on how to fix this?
Exposing port not working
I'm trying to create embeddings using infinity. There is already a docker container for that:
https://hub.docker.com/r/michaelf34/infinity
Now I've tried to launch it and expose port 7797. However, I can't reach the container via the proxy:...
Error after restarting the containers.
Command :
docker compose up
Error:
WARN[2024-07-30T12:12:22.042930970Z] Controller.NewNetwork mia-runpod-backend_default: error="failed to create DOCKER-USER IPV6 chain: iptables [+] Running 3/4es --wait -t filter -N DOCKER-USER: ip6tables v1.8.4 (legacy): can't initialize ip6tables table `filter': Table does not exist (do...ULTIMATE Stable Diffusion Kohya ComfyUI InvokeAI
doesn't start properly looks like its creating the stable diffusion container 4 time in a row
Anyone Getting Bad Pods with Internet Issues?
I'm in US, and I get a lot more bad pods with internet issues than working pods like 7 out of 10. I'm trying to spot a community pod with rtx 4090 and the default template pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04.
When I get a bad pod, I get error pulling image: Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
If the pod runs and if I connect via ssh and try to setup, I often run into problem with apt on ubuntu and python pip.
Sometimes I get certificate error, extremely slow speed less than 10 bytes per second, etc.
I have to keep launching different pod until I get a working one. Anyone has the same problem?...
Creating a pod by extending another pod
The existing Comfy pod is very basic, so each time I need to run my huge flows I would either have to reinstall all required custom nodes from scratch, or install them once and pay for disk storage. Would it be instead possible to create a new pod after I install all my required nodes, so I just deploy my pod with all required dependencies later?
Container Logs via the API or SDK
As far as I can see, there is no way to access container logs via the API, correct?
Training jobs using script
Hey, Can anyone tell me if runpod gives the feature to create a training script that can be run from anywhere and I can use that to create a GPU instance, and load and save my data to external cloud storages just like in AWS Sagemaker training script mode? I need to train multiple models in such manner with different architectures to see which one performs the best.
Possible to terminate pod from Within the pod?
I know you can terminate pod from “outside” with runpodctl- but are there any options for a pod to self-terminate, triggered by its own docker image?
Or, am I approaching this wrong and ‘best practice’ is to have your pods giving status updates back to and being managed by a script on your main PC w/ runpodctl?...
Solution:
One of them is
RUNPOD_POD_ID
which you can use to remove/terminate/kill the pod.How is runpod secret / environment vars for credentials more secure?
I'm looking at the runpod Secret feature for handling AWS credentials. It looks like 'best practice' for handling credentials in a docker image is to set them as environment variables; and Runpod's "Secrets" feature feeds into that.
Could anyone explain how using runpod's "Secrets" is more secure than just passing environment variables? If the security concern is to avoid writing your credentials directly into the image and instead pass them on launch with env vars, how do "Secrets" do anything more? Is it a feature for handling credentials within a runpod account managed by a team?...
Solution:
Yes, they are meant to keep keys secure in a team environment. With ENV variables all team members could view your keys in clear text in the template definition.
Get SSH Login Via API
When getting a pod via the API, it does not return any information on connecting via the Basic Terminal Access. Obviously the first bit of the username is pod ID but I haven't been able to identify the numbers proceeding after the dash. How might you get this username via the API or programmatically?
ssh [email protected] -i ~/.ssh/id_ed25519
cbdf4581hxb1vy == pod ID...
Llama3 setup
Hi, everyone.
We are planning to deploy Llama3 for our app with millions of users.
How can we achieve this?
And which GPU series or cloud platforms are best for achieving high speed and scalability?...
BROKEN: TheLastBen Fast Stable Diffusion
2024-07-26T18:04:17.934600984Z --2024-07-26 18:04:17-- https://huggingface.co/datasets/TheLastBen/RNPD/raw/main/Notebooks.txt
2024-07-26T18:04:17.960726061Z Resolving huggingface.co (huggingface.co)... 65.9.95.31, 65.9.95.61, 65.9.95.114, ...
2024-07-26T18:04:17.964895834Z Connecting to huggingface.co (huggingface.co)|65.9.95.31|:443... connected.
2024-07-26T18:04:18.292202440Z HTTP request sent, awaiting response... 401 Unauthorized
2024-07-26T18:04:18.292233330Z ...
Solution:
Template has been pulled for a while already because RunPod cancelled the contract with TheLastBen so he removed the files from his repo and broke it.
Network volume
Hi guys, I am new to Runpod. I am trying to set up a network volume, but I cannot see the "Connect to Jupyter Notebook" option after I deployed the GPU within the network volume. What did I miss?
network volume
Hi guys, I am new to Runpod. I am trying to set up a network volume, but I cannot see the "Connect to Jupyter Notebook" option after I deployed the GPU within the network volume. What did I miss?
ollama won't pull manifest - weird error.
In a runpod I've tried the various ollama templates, and also installed ollama on a basic template.
I can run ollama serve; but in every case when I run ollama run <model> I always get the error:
Error: pull model manifest: Get "https://registry.ollama.ai/v2/library/mistral-large/manifests/latest": dial tcp: lookup registry.ollama.ai on 127.0.0.11:53: read udp 127.0.0.1:59647->127.0.0.11:53: i/o timeout
...