#H100 PCIe pods: NVIDIA driver/CUDA init issue (Torch can’t see GPU, H100 NVL OK)

12 messages · Page 1 of 1 (latest)

humble skiff
#

Hey Runpod team, I’m seeing what looks like a driver/CUDA setup issue only on H100 PCIe pods (not H100 NVL).

Setup

  • Pod GPU: 1× H100 PCIe

  • Official template: runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404

  • pip freeze shows: torch==2.8.0+cu128

  • nvidia-smi works and reports:

    • Driver Version: 570.195.03 CUDA Version: 12.8

Problem
Even though nvidia-smi is OK, PyTorch CUDA init fails:

python - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
print("device_count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY

Output on H100 PCIe:

torch: 2.8.0+cu128
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
  return torch._C._cuda_getDeviceCount() > 0
cuda available: False
device_count: 1

So torch.cuda.is_available() is False even though device_count shows 1, and it hits a “CUDA unknown error”.
I tried multiple times with a different pod each time.

Control / comparison (works)
Same exact template, but GPU: 1× H100 NVL

  • nvidia-smi reports different stack:

    • Driver Version: 550.107.02 CUDA Version: 12.4
  • Same script output:

torch: 2.8.0+cu128
cuda available: True
device_count: 1
device 0: NVIDIA H100 NVL

Conclusion
This seems specific to the H100 PCIe host driver/CUDA setup (570.195.03 / CUDA 12.8) or how the container is wired to the host on those nodes. NVL nodes (550.107.02 / CUDA 12.4) work fine with the same image.
Happy to provide a pod ID / logs if you tell me what you need.

hard ivyBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

viscid solarBOT
humble skiff
#

H100 PCIe pods: NVIDIA driver/CUDA init issue (Torch can’t see GPU, H100 NVL OK)

rose pebble
#

Looks like Since nvidia-smi works but torch CUDA init fails, this smells like a container/runtime mismatch rather than a driver issue. On H100 PCIe specifically, I’d double check the CUDA compatibility matrix and ensure the host driver actually matches the runtime expectations inside the container.

humble skiff
mossy citrus
#

I think i have the exact same issue using the l40s

rose pebble
viscid solarBOT
neon spear
#

Hmm that's weird 12.8 torch oh you were using 12.4 in the host

#

Use the cuda filter on runpod advanced pod filters

#

To use 12.8 and above only