Serverless instances are not assigned GPUs, resulting in job error in Production. Require Assist
Error Message 1 with Stack Trace:
Task Failed [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=0220236a79a1 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=177 ; expr=cudnnCreate(&cudnnhandle); \n
Error Message 2:
Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer
Will refreshing the worker help in this situation ?
5 Replies
it says Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104]
Connection reset by peer
not any nvidia related errors, am i missing something?
it says that probably it failed to connect to some other host (via network)
Got it thanks, but Error Message 1 indicates cudnn error
that cuDNN couldn't initialize properly, which may be due to a driver issue, memory allocation issue, or an internal cuDNN bug
can you try a bigger vram gpus
its still not clear why is that error
i guess restarting the worker can help, but not always depending on whats causing this and its not clear here
If this hasn't cleared up, can you share your worker or endpoint id and I'll take a look at it?
Thanks for the response, For now, i have refreshed the worker on giving these errors, I will ping here if this error this comes.