H100 PCIe pods: NVIDIA driver/CUDA init issue (Torch can’t see GPU, H100 NVL OK) | Runpod | Page 1

humble skiff Feb 5, 2026, 9:15 AM

#

Hey Runpod team, I’m seeing what looks like a driver/CUDA setup issue only on H100 PCIe pods (not H100 NVL).

Setup

Pod GPU: 1× H100 PCIe
Official template: runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404
pip freeze shows: torch==2.8.0+cu128
nvidia-smi works and reports:
- Driver Version: 570.195.03 CUDA Version: 12.8

Problem
Even though nvidia-smi is OK, PyTorch CUDA init fails:

python - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
print("device_count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("device 0:", torch.cuda.get_device_name(0))
PY

Output on H100 PCIe:

torch: 2.8.0+cu128
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
  return torch._C._cuda_getDeviceCount() > 0
cuda available: False
device_count: 1

So torch.cuda.is_available() is False even though device_count shows 1, and it hits a “CUDA unknown error”.
I tried multiple times with a different pod each time.

Control / comparison (works)
Same exact template, but GPU: 1× H100 NVL

nvidia-smi reports different stack:
- Driver Version: 550.107.02 CUDA Version: 12.4
Same script output:

torch: 2.8.0+cu128
cuda available: True
device_count: 1
device 0: NVIDIA H100 NVL

Conclusion
This seems specific to the H100 PCIe host driver/CUDA setup (570.195.03 / CUDA 12.8) or how the container is wired to the host on those nodes. NVL nodes (550.107.02 / CUDA 12.4) work fine with the same image.
Happy to provide a pod ID / logs if you tell me what you need.

hard ivyBOT Feb 5, 2026, 9:15 AM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

viscid solarBOT Feb 5, 2026, 9:15 AM

#

humble skiff Feb 5, 2026, 9:15 AM

#

H100 PCIe pods: NVIDIA driver/CUDA init issue (Torch can’t see GPU, H100 NVL OK)

rose pebble Feb 6, 2026, 6:43 PM

#

Looks like Since nvidia-smi works but torch CUDA init fails, this smells like a container/runtime mismatch rather than a driver issue. On H100 PCIe specifically, I’d double check the CUDA compatibility matrix and ensure the host driver actually matches the runtime expectations inside the container.

humble skiff Feb 7, 2026, 12:37 PM

#

rose pebble Looks like Since nvidia-smi works but torch CUDA init fails, this smells like a ...

I tried all the official Runpod images/templates

mossy citrus Feb 7, 2026, 12:58 PM

#

I think i have the exact same issue using the l40s

rose pebble Feb 7, 2026, 1:59 PM

#

humble skiff I tried all the official Runpod images/templates

Interesting. If nvidia-smi works but torch CUDA init fails across official images, that points to runtime wiring on the PCIe hosts. Might be worth checking the exact driver version against the CUDA 12.8 runtime those images expect.

viscid solarBOT Feb 8, 2026, 2:56 PM

#

humble skiff Hey Runpod team, I’m seeing what looks like a driver/CUDA setup issue **only on ...

@humble skiff

Escalated To Zendesk

The thread has been escalated to Zendesk!

neon spear Feb 8, 2026, 3:00 PM

#

Hmm that's weird 12.8 torch oh you were using 12.4 in the host

#

Use the cuda filter on runpod advanced pod filters

#

To use 12.8 and above only

#H100 PCIe pods: NVIDIA driver/CUDA init issue (Torch can’t see GPU, H100 NVL OK)