Jaco
RRunPod
•Created by Jaco on 10/15/2024 in #⚡|serverless
No container logs, container stopped, worker unhealthy.
Hello everyone.
We run custom images on runpod to serve our inference.
We have been having a hard time getting Runpod to behave consistently.
Our serverless workers go "unhealthy" with no indication nor logs whatsoever on why that happens. Some images can't be run on most GPUs, whilst running just fine on 3090s.
* Our images run smoothly on 3090s, 4090s, and A100 on proprietary servers, Azure, and GCP.
* We cannot reproduce any of those images becoming unhealthy in any way on our own servers.
Whilst 3090s tend to usually work, with occasional unhealthy workers, A100 workers (both PCIe and SXM) are just unable to run our images.
When the worker goes unhealthy, I can't even ssh into it anymore to attempt figuring out what's wrong with it. This paired with the complete lack of logs, makes it an impossible task to debug the issue.
Is there anyone that experienced this, or has any clue on how to approach this.
6 replies