job timed out after 1 retries
Hello! Getting this on every job now on 31py4h4d9ytybu endpoint on serverless. My logs have zero messages or indication about where this is happening, from the outside it looks as if the are totally paused or non-responsive. This silently hung work for over an hour. I'm on runpod 1.7.4. This is currently having significant impacts on production work, without any clear remediation (see screenshots for no logs for many many minutes despite work happening constantly, and errors on every job). Would love some help!!


3 Replies
if you want to get logs, you must print on the main python process or the process that you run from dockerfile
Got any code that helps? or log?
is it the kill worker & finished or the error below? seems like the error below is an input error
@nerdylive lol so it ended up being that I needed to version bump my runpod package from 1.7.4 to 1.7.7. Very frustrated that a patch level version fixes this, like how would I have possibly ever found out unless I spent three days blaming myself for this trying to fix it then reaching out to customer service lol
Ooh what
Hahah yeah you should check with runpod too to check for bugs