R
RunPod2mo ago
Jaco

No container logs, container stopped, worker unhealthy.

Hello everyone. We run custom images on runpod to serve our inference. We have been having a hard time getting Runpod to behave consistently. Our serverless workers go "unhealthy" with no indication nor logs whatsoever on why that happens. Some images can't be run on most GPUs, whilst running just fine on 3090s. * Our images run smoothly on 3090s, 4090s, and A100 on proprietary servers, Azure, and GCP. * We cannot reproduce any of those images becoming unhealthy in any way on our own servers. Whilst 3090s tend to usually work, with occasional unhealthy workers, A100 workers (both PCIe and SXM) are just unable to run our images. When the worker goes unhealthy, I can't even ssh into it anymore to attempt figuring out what's wrong with it. This paired with the complete lack of logs, makes it an impossible task to debug the issue. Is there anyone that experienced this, or has any clue on how to approach this.
3 Replies
nerdylive
nerdylive2mo ago
Usually it goes unhealthy when handler throws an error, or I think when the application stopped after that You can filter the logs from the logs tab for that worker isn't it? But maybe you can contact runpod too with your endpoint ID to learn more about your error
Poddy
Poddy2mo ago
@Jaco
Escalated To Zendesk
The thread has been escalated to Zendesk!
Jaco
JacoOP2mo ago
In case anyone encouters this, the issue was specific to the python:3.10-slim base image. For some reason it only works on some 3090s. I think from runpod's point of view it'd be enough to provide more detailed worker logs. The container logs do not appear at all in this cases as the container straight up fails to start.
Want results from more Discord servers?
Add your server