jg
jg
RRunPod
Created by jg on 2/22/2024 in #⚡|serverless
[URGENT] EU-RO region endpoint currently only processing one request at a time
No description
23 replies
RRunPod
Created by jg on 2/17/2024 in #⚡|serverless
ECC errors on serverless workers using L4
We are currently using L4 machines in the eu-ro region for our production environment(30~70 workers). Based on the requests data, we have seen increasing hardware issues related to ECC errors and was wondering if we could get help in mitigating these failures.
"handler: CUDA error: uncorrectable ECC error encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
"handler: CUDA error: uncorrectable ECC error encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Based on the "Requests" data from our endpoint, we see that the failures have increased starting from 2024.02.02. We do have a couple of questions, but ultimately, it would be great if we were provided some guidance to fully handling the failing requests. - Are we expected to terminate the instance with this issue? - Is there a way to handle this from the code (and not having to do it manually) - Difference between "terminate" and "refresh" - It seems that after terminating a worker that had an uncorrected ECC issue, a new pod is respawned on the same machine. Is there a way to avoid this - For example, the machine with the ID x4udv5lkhl7d was still getting assigned pods even after terminating workers - Any recommendations on monitoring for these occurrences in the workers we use (especially for those used in production)
13 replies