jg
RRunPod
•Created by jg on 2/22/2024 in #⚡|serverless
[URGENT] EU-RO region endpoint currently only processing one request at a time
23 replies
RRunPod
•Created by jg on 2/17/2024 in #⚡|serverless
ECC errors on serverless workers using L4
We are currently using L4 machines in the eu-ro region for our production environment(30~70 workers).
Based on the requests data, we have seen increasing hardware issues related to ECC errors and was wondering if we could get help in mitigating these failures.
Based on the "Requests" data from our endpoint, we see that the failures have increased starting from 2024.02.02. We do have a couple of questions, but ultimately, it would be great if we were provided some guidance to fully handling the failing requests.
- Are we expected to terminate the instance with this issue?
- Is there a way to handle this from the code (and not having to do it manually)
- Difference between "terminate" and "refresh"
- It seems that after terminating a worker that had an uncorrected ECC issue, a new pod is respawned on the same machine. Is there a way to avoid this
- For example, the machine with the ID
x4udv5lkhl7d
was still getting assigned pods even after terminating workers
- Any recommendations on monitoring for these occurrences in the workers we use (especially for those used in production)13 replies