R
RunPod•4w ago
David Mack

n00b multi gpu question

Hello hello! I created a 4 gpu pod (screenshot), then asked pytorch what devices it saw, and it just saw one - what's the dumb thing i'm missing? Thanks 🙂
No description
No description
Solution:
Alright so, I restarted the pod (with the env var you suggested) and CUDA reported zero gpus Then I removed the env var, restarted, and CUDA now reports four GPUS. no change from previous code/config Either:...
Jump to solution
8 Replies
nerdylive
nerdylive•4w ago
Maybe the env's variables check how to use multiple gpus linux on google export CUDA_VISIBLE_DEVICES=4 try to export that env var
David Mack
David Mack•4w ago
Thanks!!!!
nerdylive
nerdylive•4w ago
Did it work
Madiator2011
Madiator2011•4w ago
import torch

if torch.cuda.is_available():
gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]

for i, gpu in enumerate(gpus):
device_name = torch.cuda.get_device_name(i)
props = torch.cuda.get_device_properties(i)
allocated = torch.cuda.memory_allocated(i)
reserved = torch.cuda.memory_reserved(i)

print(f"GPU {i}: {device_name}")
print(f" Total Memory: {props.total_memory / 1024 ** 3:.2f} GB")
print(f" Compute Capability: {props.major}.{props.minor}")
print(f" Multiprocessor Count: {props.multi_processor_count}")
print(f" Clock Rate: {props.clock_rate / 1e6} GHz")
print(f" Memory Allocated: {allocated / 1024 ** 2:.2f} MB")
print(f" Memory Reserved: {reserved / 1024 ** 2:.2f} MB")
else:
print("CUDA is not available. Only CPU is available.")
import torch

if torch.cuda.is_available():
gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]

for i, gpu in enumerate(gpus):
device_name = torch.cuda.get_device_name(i)
props = torch.cuda.get_device_properties(i)
allocated = torch.cuda.memory_allocated(i)
reserved = torch.cuda.memory_reserved(i)

print(f"GPU {i}: {device_name}")
print(f" Total Memory: {props.total_memory / 1024 ** 3:.2f} GB")
print(f" Compute Capability: {props.major}.{props.minor}")
print(f" Multiprocessor Count: {props.multi_processor_count}")
print(f" Clock Rate: {props.clock_rate / 1e6} GHz")
print(f" Memory Allocated: {allocated / 1024 ** 2:.2f} MB")
print(f" Memory Reserved: {reserved / 1024 ** 2:.2f} MB")
else:
print("CUDA is not available. Only CPU is available.")
Solution
David Mack
David Mack•4w ago
Alright so, I restarted the pod (with the env var you suggested) and CUDA reported zero gpus Then I removed the env var, restarted, and CUDA now reports four GPUS. no change from previous code/config Either: - somehow the pip install commands messed up CUDA, and restarting fixed that - runpod is flakey on if the gpus get attached or not
David Mack
David Mack•4w ago
I'll update this thread if i see flakiness My current money is on one of the pip installs (hugging face, unsloth) re-installed pytorch and broke the pod's setup
Madiator2011
Madiator2011•4w ago
not sure what you are trying to do
David Mack
David Mack•4w ago
training LLMs via hugging face DPO trainer initially installing hugging face and unsloth !pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git" !pip install --no-deps xformers trl peft accelerate bitsandbytes datasets anyway, i think i'm good now, thank you 🙂