#Serverless instances are not assigned GPUs, resulting in job error in Production. Require Assist

11 messages · Page 1 of 1 (latest)

west sapphire
#

Error Message 1 with Stack Trace:
Task Failed [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=0220236a79a1 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=177 ; expr=cudnnCreate(&cudnn_handle_); \n

Error Message 2:
Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer

Will refreshing the worker help in this situation ?

charred spindle
#

it says Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104]

#

Connection reset by peer

#

not any nvidia related errors, am i missing something?

#

it says that probably it failed to connect to some other host (via network)

west sapphire
#

Got it thanks, but Error Message 1 indicates cudnn error

#

that cuDNN couldn't initialize properly, which may be due to a driver issue, memory allocation issue, or an internal cuDNN bug

charred spindle
#

can you try a bigger vram gpus

#

its still not clear why is that error

#

i guess restarting the worker can help, but not always depending on whats causing this and its not clear here

tight mason
#

If this hasn't cleared up, can you share your worker or endpoint id and I'll take a look at it?