n00b multi gpu question
Hello hello!
I created a 4 gpu pod (screenshot), then asked pytorch what devices it saw, and it just saw one - what's the dumb thing i'm missing?
Thanks 🙂
Solution:Jump to solution
Alright so, I restarted the pod (with the env var you suggested) and CUDA reported zero gpus
Then I removed the env var, restarted, and CUDA now reports four GPUS. no change from previous code/config
Either:...
8 Replies
Maybe the env's variables
check
how to use multiple gpus linux
on google
export CUDA_VISIBLE_DEVICES=4
try to export that env varThanks!!!!
Did it work
Solution
Alright so, I restarted the pod (with the env var you suggested) and CUDA reported zero gpus
Then I removed the env var, restarted, and CUDA now reports four GPUS. no change from previous code/config
Either:
- somehow the pip install commands messed up CUDA, and restarting fixed that
- runpod is flakey on if the gpus get attached or not
I'll update this thread if i see flakiness
My current money is on one of the pip installs (hugging face, unsloth) re-installed pytorch and broke the pod's setup
not sure what you are trying to do
training LLMs via hugging face DPO trainer
initially installing hugging face and unsloth
!pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes datasets
anyway, i think i'm good now, thank you 🙂