Status endpoint only returns "COMPLETED" but no answer to the question
I'm currently using the v2/model_id/status/run_id endpoint and the results I get is follows:
{"delaytime": 26083, "executionTime":35737, "id": **, "status": "COMPLETED"}
My stream endpoint works fine but for my purposes I'd rather wait longer and retrieve the entire result at once, how am I supposed to do that?
Thank you
Solution:Jump to solution
Okay…
1) What is deployed to runpod is:
https://github.com/hommayushi3/exllama-runpod-serverless/blob/master/handler.py
...
GitHub
exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...
Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.
132 Replies
What kind of endpoint are you running. This is an issue with your endpoint not with the status API.
https://docs.runpod.io/serverless/endpoints/invoke-jobs
Run and status should be correct
Invoke a Job | RunPod Documentation
Asynchronous Endpoints
Ur main issue is maybe not returning properly
If u want reference to functions that I made to make a /run call, and just keep polling their status:
https://github.com/justinwlin/runpod_whisperx_serverless_clientside_code/blob/main/runpod_client_helper.py
GitHub
runpod_whisperx_serverless_clientside_code/runpod_client_helper.py ...
Helper functions for Runpod to automatically poll my WhisperX API. Can be adapted to other use cases - justinwlin/runpod_whisperx_serverless_clientside_code
I was using runsync instead of run, is that incorrect? I changed it to run and now I'm receiving IN_QUEUE instead
So I'm supposed to keep polling that?
Yes, /run is asynchronous, but changing it will most likely not make any difference
if it does, then /runsync is broken
Just tested and both work fine for me.
/run is great b/c /runsync I find I get a network timeout :))) but certaintly /runsync is also great if it short enough
but also /run gives u a 30 min cache on runpod's end to store ur answer vs /runsync I forget how long but its <1 min i think
so i find the 30 min cache nice
also u can add a /webhook if u want it to call back to ur webhook when done with the response instead of polling
Yea im still not getting the output, just a value that says "COMPLETED"
How do you get a network timeout with runsync? you are doing something wrong, it eventually goes to
IN_QUEUE
or IN_PROGRESS
if the request takes too long, it doesn't time out.response:
{"delayTime":662,"executionTime":9823,"id":"1d227fac-78f9-4e22-bb2e-1ff79718704a-u1","status":"COMPLETED"}
Yes, I knew it would not make a difference
Your worker is most likely throwing an error, and you are most likely capturing a dict in the
error
key which causes this to happen
error
only accepts an str
and not a dict
, RunPod made a shitty breaking change to the SDK that causes this.
So now you have to do something like:
I had this exact same issue and had to change my error handling to fix it.Sorry where does this change need to be made?
thank you for the response
in your endpoint handler file
Sorry I don't think I've ever modified that file, do I need the runpod python package to use it? I only have an endpoint that I set up
Are you using the vllm worker?
Im not sure, how can I find that out?
Not sure if that makes sense?
I get that response repeatedly
I can share my code, but as far as I can see looking from what you’ve posted, your output should be in the ’tokens’ part of the json that you get back. Try just printing everything you get back. If it’s completed, it should be there…
elif status == "COMPLETED":
tokens = json_response['output'][0]['choices'][0]['tokens']
return tokens
here's the relevant part of mine. if the status is COMPLETED, the output you want is in 'tokens'. hope this helps!
...so if I'm reading yours right, you'll want something like
I think, lol
…unless the problem really is that all you’re getting back is ’completed’ and no tokens at all anywhere. In which case forget all I said 😅
I will try this, thank you. Sorry, I didn't see this earlier
Hey the object I'm getting back doesn't have the "tokens" key. Did you use a handler function?
I just used the ready made vllm endpoint. 🤷♂️ I’m not really the one to ask. 👀
Hi @kingclimax7569 , what are you looking to deploy?
Hey I already have a serverless endpoint deployed
I'm just trying to use the status endpoint to retrieve the entire result of a query instead of using the stream endpoint to retrieve the results gradually
Is it for a LLM?
Yes
Have you tried our https://github.com/runpod-workers/worker-vllm?
We’re adding full OpenAI compatibility this week
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
I'm sorry how would that help? The problem seems to be with the runpod endpoints
Not the LLM
Ah i think ik why, do u have return_aggregate set to true?
U prob need return_aggregate_stream = true, so that if u are streaming, the streaming results become avaliable on /run
also i think he just sharing if u wanna use the vllm, runpod got a pretty good setup if its not custom model
It seems more like an issue in your worker code because status should return latest stream
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/stream_client_side.py
Here is an example of my own handler.py that I wrote for my own custom llm stuff, you can ignore all the bash stuff, that is just for my own sake; but ik this works great for streaming / retrieving + i have the clientside code that I've tested and validated
https://docs.runpod.io/serverless/workers/handlers/handler-async
Here is the document for if you want to stream / have your end result avaliable aggregated under the /status endpoint, when the stream is done which is how I based my handler off of.
Thank you so much I'm gonna go through this and report back
Sorry I'm confused here, I don't see you using rundpods serverless endpoints, but instead you're using openllm?
I added return_aggregate_stream in my code but to no avail? Can you see what's wrong in the code I posted?
Even if I do this:
I get this:
sorry
i actually misunderstood ur question
I thought u wanted to deploy ur own LLM
submit_job_and_stream_output
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/stream_client_side.py
Is maybe what you want?
Sorry, I thought before b/c of ur code, u were sharing ur python handler; i dont use the runpod endpoints that often, but probably the stream_client_side.py is prob maybe something u can try
Ik it works for the way I defined stream, with yielding, so i imagine should work for runpod
it looks like in the
check_job_status
function you're doing what I need. But the difference is I'm only receiving {"delaytime": 26083, "executionTime":35737, "id": **, "status": "COMPLETED"}
back when I do that
And when I add the handler, I get this responseSorry let me ask
is this an endpoint by runpod or by u?
Ive been very confused by this
this looks like an LLM
I guess I am confused bc u share this:
https://discord.com/channels/912829806415085598/1208117793925373983/1208143617814700032
Which is different than the below where u ask me:
https://discord.com/channels/912829806415085598/1208117793925373983/1209961564199845999
RunPod
Llama2 7B Chat
Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-7b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-7b-chat/stream/{...
But if this is ur code
Like ur own deployed code
1) U dont wrap the runpod.start in main
2) Ur function is not a generator
Yea I'm not sure at all how to use the handler function lol
Ok
Yeah, could you please elaborate what your endgoal is and we can go from there
I just want to be able to retrieve the entire result of a query at once
Instead of streaming it
What llm do u wanna use?
What model would you like to deploy, quantized or not, if quantized what quantization
I think the problem is ur code isnt correct 😅 but there is existing code u can just deploy
and have it working
And only worry about calling it
The main problem is ur code isnt correctly defined as a generator + also bc u printed and not returned (or actually should be yield) the return aggregate stream sees nothing is why u get nothing
if you look up further I posted the original code
Yes but this doesnt tell us
what model u want
Do u want llama? mistral? so on
And if there is an end goal? Like u want a custom model, u wanna do custom logic later? so on
Also this is weird bc ur mixing up
clientside and server side cose
Ur defining a url to call inside of what looks to be the handler definition
vs it should just be calling the model
hommayushi3/exllama-runpod-serverless:latest is the docker container im using
If im mixing it up please correct it lol
im not sure exactly what server any of this is supposed to go on
Ok great! So one
1) If u have no reason to deploy ur own model I recommend use runpod’s managed endpoint
https://doc.runpod.io/reference/llama2-13b-chat
2) I recommend u can use runpod vllm model instead and deploy for official support
https://github.com/runpod-workers/worker-vllm
Option #1 and deploy it by just modifying env variable
3) Use my model which is UNOFFICIAL but what I use:
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless
And i have a picture how to set it up. Instead of serverlessllm u would point it at justinwlin/whatever i have the mistral name as
RunPod
Llama2 13B Chat
Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-13b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-13b-chat/stream...
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
GitHub
GitHub - justinwlin/Runpod-OpenLLM-Pod-and-Serverless: A repo for O...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
Yeah here is a tutorial
https://discord.com/channels/912829806415085598/948767517332107274/1209990744094408774
I recommend to do the tutorial first
cause i think u have how u define ur server code (what gets put on runpod) confused with what u call from ur computer or client
ok i don't know if there's something im missing here but when my company set up a serverless endpoint it was on your website
im not sure what server im supposed to be uploading code to
All I want to do is query my existing endpoints so I can retrieve the result of a prompt all at once instead of streaming it
Im suppsed to have access to these endpoints, the one I am trying to get working 9s
/status
Okay this is a veryyyy diff situation then
yes im getting that impression lol
I thought it was straight forward
I didnt realize ur under a company
bc that means u arent the one deploying it
https://github.com/hommayushi3/exllama-runpod-serverless/blob/master/handler.py
The problem is the handler ur code is using
GitHub
exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...
Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.
when I say company is was just my boss who did it
return aggregate has to be defined here
On the server
correct
Yup so u can tell ur boss to add the return aggregate stuff we talked about
and ull be able to have the end result in the future
The problem isnt something u on the clientside (who is calling the function) can fix
The problem is what got deployed
I thought the problem is:
1) U deployed it
2) Ur looking for a solution to a deployment to what u want
but the problem is:
1) someone else deployed it
2) u want a different behavior than what is defined
well I have full access so I can do it. Are you saying there was an issue when setting up the new endpoint?
I could just set up a new one with new configs no?
Yes the
runpod.serverless.start({"handler": inference})
Needs to have
the return aggregate stuff
we ralked about
Generator Handler | RunPod Documentation
A handler that can stream fractional results.
Needs to be
runpod.serverless.start({"handler": inference,
"return_aggregate_stream": True
})
GitHub
exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...
Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.
Yea I was looking at that and was confused at where that goes
yea hopefully is clear now
okay so does that code go somewhere in the config on the website??
Solution
Okay…
1) What is deployed to runpod is:
https://github.com/hommayushi3/exllama-runpod-serverless/blob/master/handler.py
2) U need to change the line i specified on the bottom of the file. u should have a copy of this github repo locally
3) U have to rebuild and redeploy to runpod the built image
4) When u call it in the future will work
GitHub
exllama-runpod-serverless/handler.py at master · hommayushi3/exllam...
Contribute to hommayushi3/exllama-runpod-serverless development by creating an account on GitHub.
Run through the toy tutorial here if confused:
https://discord.com/channels/912829806415085598/948767517332107274/1209990744094408774
Okay forking the repo and editing the file then deploying on runpod makes sense
Hey sorry to bother you again, got it deployed and I'm not getting the same error but I'm continuously getting "IN_QUEUE" as a response
It means its in queue unless ur thing is responding
if its not responding u gotta check ur own logs
if its crashing somewhere or something
gotta check the UI console on runpod
yea im trying to view the logs but apparently there aren't any available
when building the container image I just reference my repo with the following format correct?: "<github-username>/exllama-runpod-serverless:latest"
yea it loos like ur stuff still initializing
so u gotta check the initialization logs
also it usually not ur github
username
its usually ur *dockerhub username
it should be pushed to dockerhub
it usually means if stuck initializing u didnt push it, or u tagged it wrong, or wrong platform
ahh that makes more sense, I thought using git was too easy but I oculdn't find the original repo on dockerub so I tried git. Ill give that a shot thank you
Just to reiterate:
https://discord.com/channels/912829806415085598/948767517332107274/1209990744094408774
I highly recommend again to check out this tutorial
I think will be helpful 🙂
main thing is if ur on a mac, as i said in that thread to append a --platform flag to your docker build command
Cause it goes through the process of taking code + building + shipping it to dockerhub + then using it on runpod
Thank you I'll commit it to docker hub and try it out
Does my docker hub repo need to be public? My endpoint still seems to be stuck on initializing
If it is not public u need to add a docker registry credentials under settings
otherwise it is impossible for it to find it lol
If there is no reason to have it private, id just have it public too, personally
unless u bundled some sort of trademark secret sauce
but it seems ur just using like normal llama and modified the handler.py a bit
Hey I changed it to public right after I asked that and it's running now
Stupid question haha
Thank you for your patience bro
Same result unfortunately
Same when I use the console
Can you share ur github repo?
or ur handler.py
GitHub
exllama-runpod-serverless/handler.py at master · enpro-github/exlla...
For use with runpod. Contribute to enpro-github/exllama-runpod-serverless development by creating an account on GitHub.
what input are u sending to it? weird
try with /run so that the output is actually persisted for a bit longer.
When u do it u do see the results comign out when u click the stream button?
ypu im using run
ill try the stream on sec
I find it weird that its completed
but ur UI isn't green
results of stream
which part should be green? I assume it works for you?
Nvm, I get it, b/c it wont turn green till the stream is done
wtf why am I not getting that from /status lol
what variable are u passing down?
well this is my own stuff
can i see ur input?
Are you passing down a stream: True? variable?
def inference(event) -> Union[str, Generator[str, None, None]]:
logging.info(event)
job_input = event["input"]
if not job_input:
raise ValueError("No input provided")
prompt: str = job_input.pop("prompt_prefix", prompt_prefix) + job_input.pop("prompt") + job_input.pop("prompt_suffix", prompt_suffix)
max_new_tokens = job_input.pop("max_new_tokens", 100)
stream: bool = job_input.pop("stream", False)
generator, default_settings = load_model()
settings = copy(default_settings)
settings.update(job_input)
for key, value in settings.items():
setattr(generator.settings, key, value)
if stream:
output: Union[str, Generator[str, None, None]] = generate_with_streaming(prompt, max_new_tokens)
for res in output:
yield res
else:
output_text = generator.generate_simple(prompt, max_new_tokens = max_new_tokens)
yield output_text[len(prompt):]
runpod.serverless.start({"handler": inference, "return_aggregate_stream": True})
you mean that?
It seems like ur code needs it
specifically this part
U might have been just running it as an all in one output
I find it weird that your output is not persisted still
but that is a great place to start first
sorry what line am I supposed to change exactly?
am I supposed to change it to equal True?
I assume something like the above
Since ur code is looking for a stream variable
Otherwise it says ur just going to get it all in one shot
Other than that, Idk why ur code is going wrong, u can refer to my code and try to break it down if you want:
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py
My code is a bit weird cause I wrote it to work on both serverless / gpu pod depending on env variables
but yeah other than that, I really dont know what else is going on with ur code
1) Should do /run
2) Need to pass down a stream variable that is true
I actually want to wait to get it all in one shot
ah got it
HM
I recommend to maybe just do a print statement on the output_text / what you are yielding before you slice it out
yea that sounds like a good idea
I find it weird your code can work like this tbh lol
idk how it is ending up in /stream
will I be able to see the print output in the logs
Or what is this output?
yea
Tbh, if your just doing
it all in one shot
and u dont care about stream
u could just return
yea me neither I honestly just followed a tutorial and it used this
directly
and not yield
Ah.
Very weird
Or if ur company might use stream then nvm
But yeah honestly cannot say, my best bet is u can look to my repo for guidance, ik mine works
And mine implements both stream / one shot
it isnt the same library
but the structure will be the same
otherwise i really got no clue without diving deeper and debugging myself; but ull prob need to check thro all that
u can probably just return directly is my guess too for the one shot 🤔
So I printed off the output:
It does indeed look like it's actually generating the text result
Does handler.py need to return a dictionary or something? I think Ashley alluded to this, I noticed your handler.py does
Hm i think its that runpod doesnt handle dictionaries well but im not too sure. maybe try to json.stringify() what ur yielding out. but honestly not too sure
yield json.stringify(xxxx) or something like this.
but maybe make a new post and see if runpod staff can help u out with ur repo / handler, im not too sure what the issue may be.
i dont return a dictionary
i yield back out a string which gets put into the output by runpod automatically
honestly ur code doesnt seem too bad to me so not too sure without deep diving myself and trying diff structures out
RunPod
error
field does not handle dictionaries, output is fine.when ur printing are u printing it with the output[len(prompt):] thingv or just printing the output_text?
i wonder if maybe ur string slicing is yielding an empty string
but i find it weird it shows it shows in stream for u but not normally
sounds like something to ask runpod tho if u can see in /stream
but not under /run
the only other advice i can give is strip the code to a simple ex from the docs and build it back up until it breaks
Hey sorry for the late reply, im trying this out now but the logs keep ending here
although im still getting the "COMPLETED" response, but sometimes the logs don't seem to finish all the way through
to be honest, im not too sure, this seems like something u should ask runpod in a new question / investigate and play around with different structure. I shared how my code works before and ik that works
But I dont know why urs dont work, it looks about the same to me
which is why i said maybe u just gotta run it in a GPU pod, and not just in serverless
and build it up step by step
hmm ok
thanks again you helped a lot. I feel like im a lot closer
Yeah, I would consider if u dont need streaming
just do a direct return
If u do a direct return
then less things to worry about
so dont yield
Tried that but I havent tried returning the output text without the splicing so ill try that next
Yeah, I mean you should be able to theoretically, even if this ends up not working be able to do like:
return "Hello World" and if that doesn't work something reallllllyyy is wrong somewhere. and there something being missed
cause that be the most fundamental thing
Yea that's exactly what I'm thinking, at the very least I should be able to affect the outcome, if it means breaking it
If not I'm just gonna use a new LLM lol
if u fall to that point can just use mine xD
Will do lol
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/README.md
Yeh, i mean my runpod stuff, as long u set the env variable as my readme describe, u have the clientside code / docker image ready to go lol
GitHub
Runpod-OpenLLM-Pod-and-Serverless/README.md at main · justinwlin/Ru...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
just gotta change the way ur prompting it if u want to have that system / user thing, but that is an easy thing u can preprompt it with
anyways gl gl lol