Tensorflow Runpod Container
What would happen when my spot is interrupted and then the spot is back?
A6000 price change based on # GPUS?
Conditions under invoice emails are sent
Who can I contact to get a runpod invoice for more runpod credits? (5k+)
How to add python or API bindings for an vLLM?
Bad file descriptor
ModuleNotFoundError: No module named 'diskcache'
pip install diskcache
pip install diskcache
Blocking ICMP?
My pod has randomly crashed several times today, and received emails of Runpod issues.
Can't access Jupyterlab
This is third time and no support for this issue, I lost all of my credits and time.
Do 2 GPUs will fine tune 2 times faster than 1 GPU on axolotl ?
Very slow download via JupyterLab
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memor
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
when trying to train a model on RunPod with a large batch size. I can't reproduce the error locally.
I found this https://github.com/pytorch/pytorch#docker-image and this https://pytorch.org/docs/stable/multiprocessing.html#strategy-management but I'm not sure how to fix the problem....SSH Connection Refused
runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
with 6xH100s. I added my public key to
bash -c 'apt update;DEBIAN_FRONTEND=noninteractive apt-get install openssh-server -y;mkdir -p ~/.ssh;cd $_;chmod 700 ~/.ssh;echo "$PUBLIC_KEY" >> authorized_keys;chmod 700 authorized_keys;service ssh start;sleep infinity'
(of course replaced $PUBLIC_KEY
with mine) and logged into the machine using the web terminal and checked that the authentication_key is correct. Yet I get connection refused when trying to connect. This is not the first runpod I set up (I did A100s and A40s before and both worked fine but first time for H100s)....Unable to connect to Jupyter lab
Web terminal keeps closing connection for no reason