No container logs, container stopped, worker unhealthy.
Hello everyone.
We run custom images on runpod to serve our inference.
We have been having a hard time getting Runpod to behave consistently.
Our serverless workers go "unhealthy" with no indication nor logs whatsoever on why that happens. Some images can't be run on most GPUs, whilst running just fine on 3090s.
* Our images run smoothly on 3090s, 4090s, and A100 on proprietary servers, Azure, and GCP.
* We cannot reproduce any of those images becoming unhealthy in any way on our own servers.
Whilst 3090s tend to usually work, with occasional unhealthy workers, A100 workers (both PCIe and SXM) are just unable to run our images.
When the worker goes unhealthy, I can't even ssh into it anymore to attempt figuring out what's wrong with it. This paired with the complete lack of logs, makes it an impossible task to debug the issue.
Is there anyone that experienced this, or has any clue on how to approach this.
3 Replies
Usually it goes unhealthy when handler throws an error, or I think when the application stopped after that
You can filter the logs from the logs tab for that worker isn't it?
But maybe you can contact runpod too with your endpoint ID to learn more about your error
@Jaco
Escalated To Zendesk
The thread has been escalated to Zendesk!
In case anyone encouters this, the issue was specific to the
python:3.10-slim
base image. For some reason it only works on some 3090s.
I think from runpod's point of view it'd be enough to provide more detailed worker logs.
The container logs do not appear at all in this cases as the container straight up fails to start.