Hey Runpod team, I’m seeing what looks like a driver/CUDA setup issue only on H100 PCIe pods (not H100 NVL).
Setup
-
Pod GPU: 1× H100 PCIe
-
Official template:
runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 -
pip freezeshows:torch==2.8.0+cu128 -
nvidia-smiworks and reports:Driver Version: 570.195.03 CUDA Version: 12.8
Problem
Even though nvidia-smi is OK, PyTorch CUDA init fails:
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
print("device_count:", torch.cuda.device_count())
if torch.cuda.is_available():
print("device 0:", torch.cuda.get_device_name(0))
PY
Output on H100 PCIe:
torch: 2.8.0+cu128
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
return torch._C._cuda_getDeviceCount() > 0
cuda available: False
device_count: 1
So torch.cuda.is_available() is False even though device_count shows 1, and it hits a “CUDA unknown error”.
I tried multiple times with a different pod each time.
Control / comparison (works)
Same exact template, but GPU: 1× H100 NVL
-
nvidia-smireports different stack:Driver Version: 550.107.02 CUDA Version: 12.4
-
Same script output:
torch: 2.8.0+cu128
cuda available: True
device_count: 1
device 0: NVIDIA H100 NVL
Conclusion
This seems specific to the H100 PCIe host driver/CUDA setup (570.195.03 / CUDA 12.8) or how the container is wired to the host on those nodes. NVL nodes (550.107.02 / CUDA 12.4) work fine with the same image.
Happy to provide a pod ID / logs if you tell me what you need.