Serverless Endpoint failing occasionally
i'm pretty new to runpod and started off with a serverless endpoint. when calling the api i sometimes get a failed response as return but can not really retrace whats wrong exactly.. calling the same API with the same input again works.. also the logs don't provide more information. How can i figure out, what is causing this error? Is there a best practise to catch the FAILED calls and analyze why they occur? Happy for any help!
24 Replies
+: it is my custom endpoint which takes as input a list of text articles and outputs a list of labels.
@karinenavas are you able to provide the request you're making as well as the code you're using, and the response if any.
this is the request I am making
I get the message "Job failed with status FAILED" and None return
also the logs for the worker throwing the error, do not give away too much information
I think he wanted the handler code. You should use a try, except block like this in your handler too.
And the return the errors as a string in the
error
key. It doesn't support a list or a dict.could you give me an example?
Hey
RunPod Endpoints
AI endpoints for Stable Diffusion, Dreambooth, Whisper, and many more.
or you are using your own models??
i am using my own model
open to suggestions for improvement 🙏
where is the model stored then?
i cant really see the problem there
Try to see the logs page too in serverless
try sending it here too
nah its ok bro
the rp start is normal
If you want, you can maybe sanity check and start small:
https://blog.runpod.io/serverless-create-a-basic-api/
Btw, if your on mac, make sure you are targeting the right platform of --platform amd64
Yeah haha, just deleted my comments, realized they got it right
oh ya
where ru from btw
its so late here lol
i fly around lol, currently in west coast
oooh nice
Actually i started with a Blog Post Tried to follow the steps so i thought this i a pretty simple api already
The model is from hf, works fine if i integrate it in a normal workbook, i just need the runpod gpu
Could it be possibly due to memory Overhead? Is there a propor error handling to Catch that?
Impossible to know without a full stack trace of the error
My Quick fix for now is to Catch the failed calls and retry after 5 seconds. This works but still Almost every fifth call is a fail
Best to use a try, except block and log the stack trace to your error response
Yeah thats not good, going to waste a lot of money like that
Try and except the classify_articles call?
Yeah, basically your entire handler function
Ill try, thanks for the hint