RunPod•4mo ago

No container logs, container stopped, worker unhealthy.

Hello everyone. We run custom images on runpod to serve our inference. We have been having a hard time getting Runpod to behave consistently. Our serverless workers go "unhealthy" with no indication nor logs whatsoever on why that happens. Some images can't be run on most GPUs, whilst running just fine on 3090s. * Our images run smoothly on 3090s, 4090s, and A100 on proprietary servers, Azure, and GCP. * We cannot reproduce any of those images becoming unhealthy in any way on our own servers. Whilst 3090s tend to usually work, with occasional unhealthy workers, A100 workers (both PCIe and SXM) are just unable to run our images. When the worker goes unhealthy, I can't even ssh into it anymore to attempt figuring out what's wrong with it. This paired with the complete lack of logs, makes it an impossible task to debug the issue. Is there anyone that experienced this, or has any clue on how to approach this.

3 Replies

nerdylive•4mo ago

Usually it goes unhealthy when handler throws an error, or I think when the application stopped after that You can filter the logs from the logs tab for that worker isn't it? But maybe you can contact runpod too with your endpoint ID to learn more about your error

Poddy•4mo ago

@Jaco

Escalated To Zendesk

The thread has been escalated to Zendesk!

JacoOP•4mo ago

In case anyone encouters this, the issue was specific to the python:3.10-slim base image. For some reason it only works on some 3090s. I think from runpod's point of view it'd be enough to provide more detailed worker logs. The container logs do not appear at all in this cases as the container straight up fails to start.

Gaming

Programming

No container logs, container stopped, worker unhealthy.

Did you find this page helpful?