How do I correctly stream results using runpod-python?
Currently, I'm doing the following:
-------
import runpod
runpod.api_key = 'yyyy'
endpoint = runpod.Endpoint('xxxx')
message = 'What is a synonym for purchase?'
run_request = endpoint.run({
"input": {
"prompt": message,
"sampling_params": {
"max_tokens": 5000,
"max_new_tokens": 2000,
"temperature": 0.7,
"repetition_penalty": 1.15,
"length_penalty": 10.0
}
}
})
for output in run_request.stream():
print(output)
-------
However, stream() times out after 10 seconds and I don't see a way to increase the timeout. Also, once it does work, it seems like it sends back everything at once, instead of a chunk at a time, unless I'm doing something wrong?
19 Replies
I am not sure whether the RunPod SDK supports streaming properly yet, but @Justin Merrell should be able to advise.
maybe instead of SDK, use the python request module directly yourself so u can prevent timeout
What's a typical way to structure the requests? I tried using Postman to call the stream API endpoint and it still times out after 10 seconds, returning status "IN_QUEUE". Am I supposed to repeatedly call it every 10 seconds until I get a response?
1. Call
/run
API on your endpoint.
2. Call /stream
API on your endpoint.For example, the first time I call the stream API endpoint, it returns:
{
"status": "IN_QUEUE",
"stream": []
}
Every 10 seconds or so, I call it again, and it returns the same thing. Finally, after some time, when I call it, it returns:
{
"status": "IN_PROGRESS",
"stream": [
{
"output": {
"finished": true,
"tokens": [
"\nThe word you are looking for is buy, for example, "Instead"
],
"usage": {
"input": 9,
"output": 16
}
}
}
]
}
This seems to be a chunk. But when I call it again after this, it seems like I miss the ending chunk, and just get this:
{
"status": "COMPLETED",
"stream": []
}
So now, I have to call the status API to get the full message, and subtract the chunk that I got before, in order to get the final chunk... seems pretty cumbersome, no?
https://docs.runpod.io/serverless/references/operations
Stream operation is this
Endpoint operations | RunPod Documentation
Comprehensive guide on interacting with models using RunPod's API Endpoints without managing the pods yourself.
u cant run /stream on postman
it needs to be through a curl request
or a python request
postmand oesnt support /stream responses
as far as i know
You can also use
return_aggregate_stream
.
https://docs.runpod.io/serverless/workers/handlers/handler-generatorGenerator Handler | RunPod Documentation
A handler that can stream fractional results.
Would you happen to have example code that I can look at to see how to handle streaming properly in python?
The code that ashelyk shared is the right one
wher eu can turn the key to true
or
https://docs.runpod.io/serverless/references/operations
This u can just dump into postman and it will convert
GitHub
exllama-runpod-serverless/predict.py at master · ashleykleynhans/ex...
Contribute to ashleykleynhans/exllama-runpod-serverless development by creating an account on GitHub.
RunPod
Llama2 13B Chat
Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-13b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-13b-chat/stream...
This is a pretty bad example, it will only loop for the range and not until the job is COMPLETED
The link I gave is better, it does a while True and breaks from the loop when the job is COMPLETED
Got it, thanks! This aligns with what I was thinking, although there is still the possibility that the final call might miss the last chunk, and you'll have to call the status API or something to get the full response
yes run it in a loop until status is not in progress or in queue, we have a sdk that does this automatically @Justin Merrell we have docs for that?
it wont miss the last call if you check for completed status
we plan to support openai sse soon
It missed it when I tested this in Postman. For example, I called it once and got this chunk:
{
"status": "IN_PROGRESS",
"stream": [
{
"output": {
"finished": true,
"tokens": [
"\nThe word you are looking for is buy, for example, "Instead"
],
"usage": {
"input": 9,
"output": 16
}
}
}
]
}
Notice that it's in progress and the response in "tokens" is a partial one. Then I ran it again after waiting for 1 second and got this:
{
"status": "COMPLETED",
"stream": []
}
Notice that stream is empty. So I've missed the last chunk
@Alpay Ariyak have we noticed this?
@wizardjoe make sure your sls worker is using latest sdk
You need to have input parameter “stream” set to True
The reason you see one whole chunk and then it’s completed is because you don’t set streaming on in the request input
This solved it. Thanks!