RunPod

R

RunPod

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

Join

⚡|serverless

⛅|pods

Tensorflow Runpod Container

Given this container template : https://github.com/runpod/containers/blob/main/official-templates/tensorflow/Dockerfile and the command flow in https://hub.docker.com/layers/runpod/tensorflow/1.0.3/images/sha256-5b5f23a1e7e81eeb26a85c6e95a9f7f1b664936244d9d8047c23c93c3eb1d7c6?context=explore. Is the template up to date? I want to make some modification to the template and I am curious if I will get the same tensorflow functionality as in the container version 1.0.3. Small feedback would be appre...

What would happen when my spot is interrupted and then the spot is back?

will it: 1. re-run my interrupted script and cost money 2. it would not re-run and still cost money 3. it just stops and won't do anything and won't cost money for the gpu...
Solution:
3. - you have to start it again

A6000 price change based on # GPUS?

Steps to reproduce: 1. Go to community cloud 2. Select A6000 (price 0.69/hr) 3. Change count to 2 (price 1.58/hr -- which is 0.79/hr per gpu!)...

Conditions under invoice emails are sent

I get these emails frequently, and I'm not sure under what conditions I'm supposed to get these emails. So can you tell me
No description

Who can I contact to get a runpod invoice for more runpod credits? (5k+)

I have tried contacting [email protected]. 7 days and no reply. Are there any other contacts I can try? Thanks.

How to add python or API bindings for an vLLM?

I want my vLLM to be able to execute some Python code or call an API. How do I get it to do this?

Bad file descriptor

I deployed several CPU pods with a network volume, and at first, they work well. But after a few hours, with some of them, I get a "Bad file descriptor" error when I try to access "/workspace"...

ModuleNotFoundError: No module named 'diskcache'

Receiving this error when trying to run the Stable Diffusion cell in Jupyter notebook for RP's Fast Stable Diffusion;
Solution:
Try this in your pod cli
pip install diskcache
pip install diskcache
...

Blocking ICMP?

I'm trying to set up monitoring for the runpod I've rented and can't seem to ping it. Looks like you're only allowing TCP connections? If so, is there anyway I can get around this?

My pod has randomly crashed several times today, and received emails of Runpod issues.

Today, my pod has crashed a few times, to the point where I'm receiving emails from Runpod about the issues. How can I fix?
Solution:
@rethinkstudios#001 apt-get install google-perftools...

Can't access Jupyterlab

I can't access Jupyterlab, can still use the SD webgui but can't access my data. Is there some way I can recover my workspace?
No description

This is third time and no support for this issue, I lost all of my credits and time.

I ask you to flee from runpod system one day you will no longer have access to all the data you have put into the security and community cloud. I urge you not to use it because there is no solution. I'm going to write this to all our communities who use it. Thank you for not even replying messages or issues.
No description

Spend limit

hi, i'm first time here how can I raise the $30 per hour limit at my account?...

Do 2 GPUs will fine tune 2 times faster than 1 GPU on axolotl ?

Do 2 GPUs will fine tune 2 times faster than 1 GPU on axolotl ?
Solution:
It seems

Very slow download via JupyterLab

Hey, I need to transfer rather large files from my Pod to my local machine. I am unsure on how to set up sftp (maybe thats faster?). Restarting doesn't fix the issue. What else can I try?

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor

Hi I keep getting ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally. I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem....

SSH Connection Refused

I'm using template runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 with 6xH100s. I added my public key to bash -c 'apt update;DEBIAN_FRONTEND=noninteractive apt-get install openssh-server -y;mkdir -p ~/.ssh;cd $_;chmod 700 ~/.ssh;echo "$PUBLIC_KEY" >> authorized_keys;chmod 700 authorized_keys;service ssh start;sleep infinity' (of course replaced $PUBLIC_KEY with mine) and logged into the machine using the web terminal and checked that the authentication_key is correct. Yet I get connection refused when trying to connect. This is not the first runpod I set up (I did A100s and A40s before and both worked fine but first time for H100s)....

Unable to connect to Jupyter lab

Seems like Jupyter lab has crashed on my pod after a job running for around 2 days . This is unfortunate . Is there anyway I can restart jupyter lab so that I can resume training ? Is it also possible that my process may still be running despite Jupyter lab having crashed ?
No description

Web terminal keeps closing connection for no reason

I have an on demand GPU pod deployed and I'm running a shell script that's training a model through the web terminal. Systematically, every roughly 1h40m, the web terminal dies with the message "Connection closed", for seemingly no reason. This is very frustrating as I'm paying for on-demand specifically because I want to be able to leave it training for a long period unattended. What can be done to fix this?

No module named 'axolotl.cli

I get No module named 'axolotl.cli
No description