Serverless instances are not assigned GPUs, resulting in job error in Production. Require Assist

Error Message 1 with Stack Trace: Task Failed [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=0220236a79a1 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=177 ; expr=cudnnCreate(&cudnnhandle); \n Error Message 2: Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer Will refreshing the worker help in this situation ?
5 Replies
Jason
Jason2d ago
it says Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer not any nvidia related errors, am i missing something? it says that probably it failed to connect to some other host (via network)
Shubham Patel DJ
Shubham Patel DJOP2d ago
Got it thanks, but Error Message 1 indicates cudnn error that cuDNN couldn't initialize properly, which may be due to a driver issue, memory allocation issue, or an internal cuDNN bug
Jason
Jason2d ago
can you try a bigger vram gpus its still not clear why is that error i guess restarting the worker can help, but not always depending on whats causing this and its not clear here
Dj
Dj2d ago
If this hasn't cleared up, can you share your worker or endpoint id and I'll take a look at it?
Shubham Patel DJ
Shubham Patel DJOP19h ago
Thanks for the response, For now, i have refreshed the worker on giving these errors, I will ping here if this error this comes.

Did you find this page helpful?