RunPod•15mo ago

How do I correctly stream results using runpod-python?

Currently, I'm doing the following: ------- import runpod runpod.api_key = 'yyyy' endpoint = runpod.Endpoint('xxxx') message = 'What is a synonym for purchase?' run_request = endpoint.run({ "input": { "prompt": message, "sampling_params": { "max_tokens": 5000, "max_new_tokens": 2000, "temperature": 0.7, "repetition_penalty": 1.15, "length_penalty": 10.0 } } }) for output in run_request.stream(): print(output) ------- However, stream() times out after 10 seconds and I don't see a way to increase the timeout. Also, once it does work, it seems like it sends back everything at once, instead of a chunk at a time, unless I'm doing something wrong?

19 Replies

ashleyk•15mo ago

I am not sure whether the RunPod SDK supports streaming properly yet, but @Justin Merrell should be able to advise.

J.•15mo ago

maybe instead of SDK, use the python request module directly yourself so u can prevent timeout

wizardjoeOP•15mo ago

What's a typical way to structure the requests? I tried using Postman to call the stream API endpoint and it still times out after 10 seconds, returning status "IN_QUEUE". Am I supposed to repeatedly call it every 10 seconds until I get a response?

ashleyk•15mo ago

1. Call /run API on your endpoint. 2. Call /stream API on your endpoint.

wizardjoeOP•15mo ago

For example, the first time I call the stream API endpoint, it returns: { "status": "IN_QUEUE", "stream": [] } Every 10 seconds or so, I call it again, and it returns the same thing. Finally, after some time, when I call it, it returns: { "status": "IN_PROGRESS", "stream": [ { "output": { "finished": true, "tokens": [ "\nThe word you are looking for is buy, for example, "Instead" ], "usage": { "input": 9, "output": 16 } } } ] } This seems to be a chunk. But when I call it again after this, it seems like I miss the ending chunk, and just get this: { "status": "COMPLETED", "stream": [] } So now, I have to call the status API to get the full message, and subtract the chunk that I got before, in order to get the final chunk... seems pretty cumbersome, no?

J.•15mo ago

https://docs.runpod.io/serverless/references/operations Stream operation is this

Endpoint operations | RunPod Documentation

Comprehensive guide on interacting with models using RunPod's API Endpoints without managing the pods yourself.

J.•15mo ago

u cant run /stream on postman it needs to be through a curl request or a python request postmand oesnt support /stream responses as far as i know

ashleyk•15mo ago

You can also use return_aggregate_stream. https://docs.runpod.io/serverless/workers/handlers/handler-generator

Generator Handler | RunPod Documentation

A handler that can stream fractional results.

wizardjoeOP•15mo ago

Would you happen to have example code that I can look at to see how to handle streaming properly in python?

J.•15mo ago

The code that ashelyk shared is the right one wher eu can turn the key to true or https://docs.runpod.io/serverless/references/operations This u can just dump into postman and it will convert

ashleyk•15mo ago

https://github.com/ashleykleynhans/exllama-runpod-serverless/blob/master/predict.py

GitHub

exllama-runpod-serverless/predict.py at master · ashleykleynhans/ex...

Contribute to ashleykleynhans/exllama-runpod-serverless development by creating an account on GitHub.

J.•15mo ago

https://doc.runpod.io/reference/llama2-13b-chat#streaming-token-outputs

RunPod

Llama2 13B Chat

Retrieve Results & StatusNote: For information on how to check job status and retrieve results, please refer to our Status Endpoint Documentation.Streaming Token Outputs Make a POST request to the /llama2-13b-chat/run API endpoint.Retrieve the job ID.Make a GET request to /llama2-13b-chat/stream...

ashleyk•15mo ago

This is a pretty bad example, it will only loop for the range and not until the job is COMPLETED The link I gave is better, it does a while True and breaks from the loop when the job is COMPLETED

wizardjoeOP•15mo ago

Got it, thanks! This aligns with what I was thinking, although there is still the possibility that the final call might miss the last chunk, and you'll have to call the status API or something to get the full response

flash-singh•15mo ago

yes run it in a loop until status is not in progress or in queue, we have a sdk that does this automatically @Justin Merrell we have docs for that? it wont miss the last call if you check for completed status we plan to support openai sse soon

wizardjoeOP•15mo ago

It missed it when I tested this in Postman. For example, I called it once and got this chunk: { "status": "IN_PROGRESS", "stream": [ { "output": { "finished": true, "tokens": [ "\nThe word you are looking for is buy, for example, "Instead" ], "usage": { "input": 9, "output": 16 } } } ] } Notice that it's in progress and the response in "tokens" is a partial one. Then I ran it again after waiting for 1 second and got this: { "status": "COMPLETED", "stream": [] } Notice that stream is empty. So I've missed the last chunk

flash-singh•15mo ago

@Alpay Ariyak have we noticed this? @wizardjoe make sure your sls worker is using latest sdk

Alpay Ariyak•15mo ago

You need to have input parameter “stream” set to True The reason you see one whole chunk and then it’s completed is because you don’t set streaming on in the request input

wizardjoeOP•15mo ago

This solved it. Thanks!

Gaming

Programming

How do I correctly stream results using runpod-python?

Did you find this page helpful?