R
RunPod12mo ago
wizardjoe

How do I correctly stream results using runpod-python?

Currently, I'm doing the following: ------- import runpod runpod.api_key = 'yyyy' endpoint = runpod.Endpoint('xxxx') message = 'What is a synonym for purchase?' run_request = endpoint.run({ "input": { "prompt": message, "sampling_params": { "max_tokens": 5000, "max_new_tokens": 2000, "temperature": 0.7, "repetition_penalty": 1.15, "length_penalty": 10.0 } } }) for output in run_request.stream(): print(output) ------- However, stream() times out after 10 seconds and I don't see a way to increase the timeout. Also, once it does work, it seems like it sends back everything at once, instead of a chunk at a time, unless I'm doing something wrong?
19 Replies
ashleyk
ashleyk12mo ago
I am not sure whether the RunPod SDK supports streaming properly yet, but @Justin Merrell should be able to advise.
justin
justin12mo ago
maybe instead of SDK, use the python request module directly yourself so u can prevent timeout
wizardjoe
wizardjoeOP12mo ago
What's a typical way to structure the requests? I tried using Postman to call the stream API endpoint and it still times out after 10 seconds, returning status "IN_QUEUE". Am I supposed to repeatedly call it every 10 seconds until I get a response?
ashleyk
ashleyk12mo ago
1. Call /run API on your endpoint. 2. Call /stream API on your endpoint.
wizardjoe
wizardjoeOP12mo ago
For example, the first time I call the stream API endpoint, it returns: { "status": "IN_QUEUE", "stream": [] } Every 10 seconds or so, I call it again, and it returns the same thing. Finally, after some time, when I call it, it returns: { "status": "IN_PROGRESS", "stream": [ { "output": { "finished": true, "tokens": [ "\nThe word you are looking for is buy, for example, "Instead" ], "usage": { "input": 9, "output": 16 } } } ] } This seems to be a chunk. But when I call it again after this, it seems like I miss the ending chunk, and just get this: { "status": "COMPLETED", "stream": [] } So now, I have to call the status API to get the full message, and subtract the chunk that I got before, in order to get the final chunk... seems pretty cumbersome, no?
justin
justin12mo ago
Endpoint operations | RunPod Documentation
Comprehensive guide on interacting with models using RunPod's API Endpoints without managing the pods yourself.
justin
justin12mo ago
u cant run /stream on postman it needs to be through a curl request or a python request postmand oesnt support /stream responses as far as i know
ashleyk
ashleyk12mo ago
Generator Handler | RunPod Documentation
A handler that can stream fractional results.
wizardjoe
wizardjoeOP12mo ago
Would you happen to have example code that I can look at to see how to handle streaming properly in python?
justin
justin12mo ago
The code that ashelyk shared is the right one wher eu can turn the key to true or https://docs.runpod.io/serverless/references/operations This u can just dump into postman and it will convert
ashleyk
ashleyk12mo ago
GitHub
exllama-runpod-serverless/predict.py at master · ashleykleynhans/ex...
Contribute to ashleykleynhans/exllama-runpod-serverless development by creating an account on GitHub.
justin
justin12mo ago
RunPod
Llama2 13B Chat
Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-13b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-13b-chat/stream...
ashleyk
ashleyk12mo ago
This is a pretty bad example, it will only loop for the range and not until the job is COMPLETED The link I gave is better, it does a while True and breaks from the loop when the job is COMPLETED
wizardjoe
wizardjoeOP12mo ago
Got it, thanks! This aligns with what I was thinking, although there is still the possibility that the final call might miss the last chunk, and you'll have to call the status API or something to get the full response
flash-singh
flash-singh12mo ago
yes run it in a loop until status is not in progress or in queue, we have a sdk that does this automatically @Justin Merrell we have docs for that? it wont miss the last call if you check for completed status we plan to support openai sse soon
wizardjoe
wizardjoeOP12mo ago
It missed it when I tested this in Postman. For example, I called it once and got this chunk: { "status": "IN_PROGRESS", "stream": [ { "output": { "finished": true, "tokens": [ "\nThe word you are looking for is buy, for example, "Instead" ], "usage": { "input": 9, "output": 16 } } } ] } Notice that it's in progress and the response in "tokens" is a partial one. Then I ran it again after waiting for 1 second and got this: { "status": "COMPLETED", "stream": [] } Notice that stream is empty. So I've missed the last chunk
flash-singh
flash-singh12mo ago
@Alpay Ariyak have we noticed this? @wizardjoe make sure your sls worker is using latest sdk
Alpay Ariyak
Alpay Ariyak12mo ago
You need to have input parameter “stream” set to True The reason you see one whole chunk and then it’s completed is because you don’t set streaming on in the request input
wizardjoe
wizardjoeOP12mo ago
This solved it. Thanks!

Did you find this page helpful?