R
RunPod12mo ago
Jack

Queued serverless workers not running and getting charged for it?

I woke up this morning to find that all the credits in my Runpod account are gone. I don't have any active pods and only have a single network volume of 100GB. I didn't know why but noticed that there are 2 queued workers for one of my serverless endpoints. I was testing in Postman yesterday and sent a few requests, maybe like 10 in total. I had assumed that requests that didn't get a response after some time were automatically terminated. As you can see, these 2 requests are still in queue after over 10 hours. And I'm guessing I'm being charged the whole time for these requests. Is this normal behavior? There are no other requests. Just these 2 requests queued up. Why are they queued up? Why aren't they returning a result or at least an error, and simply stuck in queue? This has been a really bad noob experience to using Runpod. And I'm hesitant putting more money into my account now
No description
Solution:
@Jack They cant tell if ur workers are just working or not. There isnt a runtime timeout bc u might for ex. actually he processing for that long - which is common for a use case like mine doing large video or audio processing. Recommendation is go through the process on a gpu pod in the future with ur handler.py and make sure works as expected there / then can monitor and send a request using the built in testing endpoint on runpod and monitor how its going with logs. With a gpu pod at least tho can see in a jupyter notebook if everything with ur handler.py logic is going as expected and can invoke it just calling the method normally...
Jump to solution
9 Replies
flash-singh
flash-singh12mo ago
they are stuck in queue, if your worker isn't setup properly, this could run in loop where your workers keep starting but are not setup to handle the job and get shut off again, cycle restart again its best to set max workers to 0 until you know your workload works end to end
Solution
justin
justin12mo ago
@Jack They cant tell if ur workers are just working or not. There isnt a runtime timeout bc u might for ex. actually he processing for that long - which is common for a use case like mine doing large video or audio processing. Recommendation is go through the process on a gpu pod in the future with ur handler.py and make sure works as expected there / then can monitor and send a request using the built in testing endpoint on runpod and monitor how its going with logs. With a gpu pod at least tho can see in a jupyter notebook if everything with ur handler.py logic is going as expected and can invoke it just calling the method normally
Jack
JackOP12mo ago
An action that's stuck in queue for this long should be automatically ended imo. Like it's stuck in queue and not even running, for 10 hours. And it's being charged the whole time. There needs to be a structure in place to deal with situations like this, because errors could occur not just in development, but also production. I sent the request in Postman. Postman had already given up on the request after half a minute, giving an error message saying "Could not send request". But at the point, it's still stuck in queue in the serverless endpoint, and I have to manually cancel the request.
justin
justin12mo ago
Yeh, I know it sucks, but: Your error on postman is due to a network timeout probably, but isn't due to the system itself failing for all it knows. IMO it seems ur code got caught in a loop potentially, I cannot imagine the worker itself is stuck in a loop without logic errors or a library unable to load (but even this usually results in a Failure crash for my response). For ex. I could send a request to a backend endpoint for some sort of processing, and just b/c I don't get a message back, doesn't mean it fails. Side note, if you do continue with runpod, I recommend to try to do it by using /run instead of /runsync. Runsync has some issues due to network request timeout even if it is successful. I think /run is more reliable to poll against or get a /webhook response back is another option Have some ex. client-side code if you want a reference: https://github.com/justinwlin/runpod_whisperx_serverless_clientside_code
justin
justin12mo ago
justin
justin12mo ago
also in the future can attach an execution timeout it seems
flash-singh
flash-singh12mo ago
we plan to make further improvements, in your case job likely never got picked up, my hunch is you didnt call serverless.start python function, in this case workers start and eventually get killed but job never gets picked up since serverless sdk was never started
MikeCalibos
MikeCalibos11mo ago
When I try to run a runpod with invokeai, I just get a Server Error and Runtime Error when I try to generate an image.
justin
justin11mo ago
Wrong place. Feel free to open another ticket under #⛅|gpu-cloud or #⚡|serverless with more information 🙂
Want results from more Discord servers?
Add your server