ECC errors on serverless workers using L4
We are currently using L4 machines in the eu-ro region for our production environment(30~70 workers).
Based on the requests data, we have seen increasing hardware issues related to ECC errors and was wondering if we could get help in mitigating these failures.
Based on the "Requests" data from our endpoint, we see that the failures have increased starting from 2024.02.02. We do have a couple of questions, but ultimately, it would be great if we were provided some guidance to fully handling the failing requests.
- Are we expected to terminate the instance with this issue?
- Is there a way to handle this from the code (and not having to do it manually)
- Difference between "terminate" and "refresh"
- It seems that after terminating a worker that had an uncorrected ECC issue, a new pod is respawned on the same machine. Is there a way to avoid this
- For example, the machine with the ID
x4udv5lkhl7d
was still getting assigned pods even after terminating workers
- Any recommendations on monitoring for these occurrences in the workers we use (especially for those used in production)6 Replies
we have a uncorrectable ecc check already built-in, will look into that and see why that one server isn't being flagged
thanks! we keep seeing this particular machine (
x4udv5lkhl7d
) with ECC errorsDid you try terminating the worker? I usually terminate the worker when this kind of thing happens.
We've tried terminating, but at some later point in time, some of our workers get spawned on the same machine that has been throwing ECC errors.
Even after refreshing, the machine might recover, but this as well fails after some time.
@flash-singh I know you guys might be on holiday but do you have any updates for us?
this is worker id?
i was able to find the gpu causing this, we were checking for ecc.errors.uncorrected.volatile.total, while thats 0, ecc.errors.uncorrected.aggregate.total shows a high number of faults
https://gist.github.com/sansmoraxz/8a98d987f12d7edc983d611b8326fc67
will have to roll an update to start flagging gpus with those errors
this is solved now, took the server out of the pool
Awesome! Thank you very much for the help.
We're seeing no failures so far from our endpoint in production đź‘Ť