#n00b multi gpu question

16 messages · Page 1 of 1 (latest)

sullen sigil
#

Hello hello!

I created a 4 gpu pod (screenshot), then asked pytorch what devices it saw, and it just saw one - what's the dumb thing i'm missing?

Thanks 🙂

west blaze
#

check how to use multiple gpus linux
on google

#

export CUDA_VISIBLE_DEVICES=4

#

try to export that env var

sullen sigil
#

Thanks!!!!

west blaze
#

Did it work

lusty spruce
#
import torch

if torch.cuda.is_available():
    gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
    
    for i, gpu in enumerate(gpus):
        device_name = torch.cuda.get_device_name(i)
        props = torch.cuda.get_device_properties(i)
        allocated = torch.cuda.memory_allocated(i)
        reserved = torch.cuda.memory_reserved(i)
        
        print(f"GPU {i}: {device_name}")
        print(f"  Total Memory: {props.total_memory / 1024 ** 3:.2f} GB")
        print(f"  Compute Capability: {props.major}.{props.minor}")
        print(f"  Multiprocessor Count: {props.multi_processor_count}")
        print(f"  Clock Rate: {props.clock_rate / 1e6} GHz")
        print(f"  Memory Allocated: {allocated / 1024 ** 2:.2f} MB")
        print(f"  Memory Reserved: {reserved / 1024 ** 2:.2f} MB")
else:
    print("CUDA is not available. Only CPU is available.")
sullen sigil
#

Alright so, I restarted the pod (with the env var you suggested) and CUDA reported zero gpus

Then I removed the env var, restarted, and CUDA now reports four GPUS. no change from previous code/config

Either:

  • somehow the pip install commands messed up CUDA, and restarting fixed that
  • runpod is flakey on if the gpus get attached or not
#

I'll update this thread if i see flakiness

#

My current money is on one of the pip installs (hugging face, unsloth) re-installed pytorch and broke the pod's setup

lusty spruce
#

not sure what you are trying to do

sullen sigil
#

training LLMs via hugging face DPO trainer

#

initially installing hugging face and unsloth

#

!pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes datasets

#

anyway, i think i'm good now, thank you 🙂