RunPod•3d ago

Serverless instances are not assigned GPUs, resulting in job error in Production. Require Assist

Error Message 1 with Stack Trace: Task Failed [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char, const char, int) [with ERRTYPE = cudnnStatus_t; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDNN failure 4: CUDNN_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=0220236a79a1 ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=177 ; expr=cudnnCreate(&cudnnhandle); \n Error Message 2: Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer Will refreshing the worker help in this situation ?

5 Replies

Jason•2d ago

it says Failed to get job. | Error Type: ClientOSError | Error Message: [Errno 104] Connection reset by peer not any nvidia related errors, am i missing something? it says that probably it failed to connect to some other host (via network)

Shubham Patel DJOP•2d ago

Got it thanks, but Error Message 1 indicates cudnn error that cuDNN couldn't initialize properly, which may be due to a driver issue, memory allocation issue, or an internal cuDNN bug

Jason•2d ago

can you try a bigger vram gpus its still not clear why is that error i guess restarting the worker can help, but not always depending on whats causing this and its not clear here

Dj•2d ago

If this hasn't cleared up, can you share your worker or endpoint id and I'll take a look at it?

Shubham Patel DJOP•19h ago

Thanks for the response, For now, i have refreshed the worker on giving these errors, I will ping here if this error this comes.

Gaming

Programming

Serverless instances are not assigned GPUs, resulting in job error in Production. Require Assist

Did you find this page helpful?