#CUDA initialization fails on one RTX 5090 serverless worker but succeeds on another identical worker

4 messages · Page 1 of 1 (latest)

surreal plinth
#

One serverless worker on an endpoint fails to initialize CUDA under PyTorch 2.11.0+cu128 on an RTX 5090, while a second worker in the same endpoint runs the identical workload successfully.
The only observed difference between the two workers is the NVIDIA driver build:

  • Failing worker: 580.126.09
  • Working worker: 580.126.20

On the failing worker, the pod accepts jobs but loops during startup because CUDA initialization never succeeds. This causes affected requests to consume retry budget without completing successfully.

Affected resources

  • Data center: EUR-NO-1
  • GPU types enabled: RTX 4090, RTX 5090
  • Minimum CUDA version configured: 12.8

Observed workers

  • Failing pod: r4olk6c2f93dny
    • GPU: RTX 5090
    • Driver: 580.126.09
  • Working pod: h9q3efquin7ztc
    • GPU: RTX 5090
    • Driver: 580.126.20

Observation window

  • 2026-04-20 from 15:18 UTC to 15:35 UTC

Both workers were running:

  • the same container image
  • the same Python virtual environment on the shared network volume
  • the same handler code

Worker A (failing)

Pod: r4olk6c2f93dny

GPU Info: gpu_name=NVIDIA GeForce RTX 5090, compute_cap=12.0, vram_mb=32607,
  driver_version=580.126.09, torch_version=2.11.0+cu128, torch_cuda_version=12.8,
  device_capability=None, arch_list=[]

ComfyUI traceback on this worker (originating from torch._C._cuda_init() in torch/cuda/__init__.py:478):

RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment,
e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the
available devices to be zero.

Worker B (working)

Pod: h9q3efquin7ztc

GPU Info: gpu_name=NVIDIA GeForce RTX 5090, compute_cap=12.0, vram_mb=32607,
  driver_version=580.126.20, torch_version=2.11.0+cu128, torch_cuda_version=12.8,
  device_capability=[12, 0], arch_list=['sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120']
alpine burrowBOT
steel daggerBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

lone rain
#

I get the same error on some of the EUR-NO-1 5090s with min. cuda version 13:

RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:180: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:119.)