RunPod•9mo ago

n00b multi gpu question

Hello hello! I created a 4 gpu pod (screenshot), then asked pytorch what devices it saw, and it just saw one - what's the dumb thing i'm missing? Thanks 🙂

Solution:

Jump to solution

8 Replies

nerdylive•9mo ago

Maybe the env's variables check how to use multiple gpus linux on google export CUDA_VISIBLE_DEVICES=4 try to export that env var

David MackOP•9mo ago

Thanks!!!!

nerdylive•9mo ago

Did it work

Madiator2011•9mo ago

import torch

if torch.cuda.is_available():
    gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
    
    for i, gpu in enumerate(gpus):
        device_name = torch.cuda.get_device_name(i)
        props = torch.cuda.get_device_properties(i)
        allocated = torch.cuda.memory_allocated(i)
        reserved = torch.cuda.memory_reserved(i)
        
        print(f"GPU {i}: {device_name}")
        print(f"  Total Memory: {props.total_memory / 1024 ** 3:.2f} GB")
        print(f"  Compute Capability: {props.major}.{props.minor}")
        print(f"  Multiprocessor Count: {props.multi_processor_count}")
        print(f"  Clock Rate: {props.clock_rate / 1e6} GHz")
        print(f"  Memory Allocated: {allocated / 1024 ** 2:.2f} MB")
        print(f"  Memory Reserved: {reserved / 1024 ** 2:.2f} MB")
else:
    print("CUDA is not available. Only CPU is available.")

import torch

if torch.cuda.is_available():
    gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
    
    for i, gpu in enumerate(gpus):
        device_name = torch.cuda.get_device_name(i)
        props = torch.cuda.get_device_properties(i)
        allocated = torch.cuda.memory_allocated(i)
        reserved = torch.cuda.memory_reserved(i)
        
        print(f"GPU {i}: {device_name}")
        print(f"  Total Memory: {props.total_memory / 1024 ** 3:.2f} GB")
        print(f"  Compute Capability: {props.major}.{props.minor}")
        print(f"  Multiprocessor Count: {props.multi_processor_count}")
        print(f"  Clock Rate: {props.clock_rate / 1e6} GHz")
        print(f"  Memory Allocated: {allocated / 1024 ** 2:.2f} MB")
        print(f"  Memory Reserved: {reserved / 1024 ** 2:.2f} MB")
else:
    print("CUDA is not available. Only CPU is available.")

Solution

David Mack•9mo ago

Alright so, I restarted the pod (with the env var you suggested) and CUDA reported zero gpus Then I removed the env var, restarted, and CUDA now reports four GPUS. no change from previous code/config Either: - somehow the pip install commands messed up CUDA, and restarting fixed that - runpod is flakey on if the gpus get attached or not

David MackOP•9mo ago

I'll update this thread if i see flakiness My current money is on one of the pip installs (hugging face, unsloth) re-installed pytorch and broke the pod's setup

Madiator2011•9mo ago

not sure what you are trying to do

David MackOP•9mo ago

training LLMs via hugging face DPO trainer initially installing hugging face and unsloth !pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git" !pip install --no-deps xformers trl peft accelerate bitsandbytes datasets anyway, i think i'm good now, thank you 🙂

Gaming

Programming

n00b multi gpu question

Did you find this page helpful?