Worker frozen during long running process
request ID: sync-f144b2f4-f9cd-4789-8651-491203e84175-u1
worker id: g9y8icaexnzrlr
I have a process that should in theory take no longer than 90 seconds
The template is configured to not timeout
when i test the process via the requests tab in the UI, the logs for the process print smoothly until about halfway through the process, and then the logs disappear. The job never completes, and the worker goes idle after a minute or two. I cant see the logs to know if there is a failure or error.
Does someone mind checking on this for me?
22 Replies
@yhlong00000
a little more context:
given its a relatively long running process, i started testing with the run endpoint, to which i was not able to get a stream of logs past 15 seconds.
So then i switched to runsync, which i understand is supposed to be for relatively quick processes, in attempt to see more of the logs. i was able to see more of the logs, but then the aforementioned original problem arose.
Which runpodctl version are you using there
going to take me 10 mins to build the image again to run a version command and check but will shoot it in here asap
while im waiting to get you the exact version, im using python 3.8. so whatever version gets installed automatically in that python version.
import runpod
print(f"Runpod version: {runpod.version}")
Runpod version: 1.7.3
so after reading about runpodclt, this library is meant for streaming logs to your machine, and you were not asking what version of the runpod library i was using in my code.
i have only been using the UI for testing. I will now try to use runpodclt to stream logs to see if i can get a better idea of what is happenning later in my process
runpodctl v1.14.4
Oh I meant the runpod library, my bad
I heard that new versions of the runpod library has bugs but I'm not sure if it causes rhis
Yeah it might be
no luck getting log stream with runpodclt
@zfmoodydub
Escalated To Zendesk
The thread has been escalated to Zendesk!
@yhlong00000 not sure if you can help...
the requests are now getting lost as well, they get placed in a queue but then are never returned as failed.
here is another example:
pod id: h6uc0sa88m2n5t
request id: e2980df5-7861-46b7-8c9a-65a4171c30ad-u1
hey, this is caused by sdk 1.7.3, try to downgrade to 1.6.2 should solve your problem, we should have 1.7.4 release soon and it will fix this.
great thanks. just to be clear, im not able to retrieve any logs via the ctl tool. testing via the UI, the logs stop, perhaps they are too verbose, i get a message in the container logs saying:
No Container logs yet, this usually means that the pod is still initializing
and my process halts about halfway through, and the worker goes idle. so even if im downgrading the ctl tool to try to extract the logs, im not sure they will even be available. i will check though.
also on github there is no release for 1.6.2, only 1.6.1 and 1.7.0.
brew install returns this:
brew install [email protected]
No available formula with the name "[email protected]". Did you mean runpodctl?
You ah the cli tool isn't for retrieving logs
No it's "runpod" pip package
Runpod only is SDK, runpodctl is the cli you can use for creating resources(pods, Network storage), delete resources , etc
got it
Nice
i still think i am unable to retrieve container logs via the runpod library though, is that correct?
thats what im trying to do as there seems to be a bug in the UI
You can, via the website when it's running
When your worker is running, click one of them then there Wil be a button called logs
You can print messages from inside your worker to be out on the logs
right, but this whole thread ive been talking about how those logs disappear halfway through the execution.
~15 seconds in. and i cant see any of my debug logs because of that
That's unexpected behavior, what do you find now after downgrading runpod library to like 1.6.2 then re deploying the worker image
How much do you print BTW? Is it spammy or Alot?
the whole process would probably be abt 200 lines
i see what you mean now. downgrade the version in the container i build. sorry for my misunderstanding
Ah yeah it is inside the worker image, no problem maybe I didn't make it clear enough for you
it seems as though downgrading the package worked. thank you very much for your help
Nice your welcome
SDK 1.7.4 has been released. Thank you for your patience.