Serverless Endpoint Streaming
I'm currently working with Llama.cpp for my inference and have setup my handler.py file to be similar to this guide.
https://docs.runpod.io/docs/handler-generator
My input and handler file looks like this:
My problem is that whenever I am testing this out in the requests tab on the dashboard, it keeps saying stream is empty.
https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...
22 Replies
For streaming to work your handler needs to yield, take a look at https://github.com/runpod-workers/worker-vllm/blob/a247a3afe10a7d9002fb1f35971b7c5e29873950/src/handler.py#L13
Thank you!
No problem, let me know if you run into any other issues 🙂
I'm assuimg this wouldn't really work if I wanted to test locally because the output would just be async generator objects right?
How are you testing locally? If you are starting the API server with the
rp_serve_api
flag there is a stream test endpointJust python rp_handler.py
Could you show your local test run to see what you are doing specifically?
Not sure if I added yield correctly but below is my output:
--- Starting Serverless Worker | Version 1.4.0 ---
INFO | Using test_input.json as job input.
DEBUG | Retrieved local job: {'input': {'query': 'Temp', 'stream': True}, 'id': 'local_test'}
INFO | local_test | Started.
DEBUG | local_test | Handler output: <async_generator object handler at 0x7f83774a6440>
DEBUG | local_test | run_job return: {'output': <async_generator object handler at 0x7f83774a6440>}
INFO | Job local_test completed successfully.
INFO | Job result: {'output': <async_generator object handler at 0x7f83774a6440>}
INFO | Local testing complete, exiting.
I see, I will need to re-visit the testing for streaming. Calling your program with the
rp_serve_api
is going to be the current option for testingTried using it. /stream only returns 'detail not found'.
I should be expecting my result like this?:
Correct, could you provide a screenshot of your test? I'll double check that it is working as expected.
Hey Justin! I followed the readme above with the vllm runpod worker and got the streaming to work. One of my questions is that my delay time on this new endpoint is significantly longer than my original endpoint.
Seems like adding more workers speeds it up?
When you sent the first request in were there any workers ready?
I think it was a cold start. I'm struggling for the workers now to turn off after running.
Ah active workers
I'm dumb.
I turned on the active worker since the cold boot was 50 seconds haha
Could u please tell, how were u able to stream the output? What changes did you make in your previous code which already had "yield" for streaming and still was not working?
https://github.com/runpod-workers/worker-vllm
Check this out :)
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...
I checked this repo, but wasn't able to figure out what was needed. Could u please be a little specific like what exactly helped in streaming apart from yield?
I would fork this repo and adapt it to what you need.
Use this worker to build/init your llm
Hi, could you please provide the url of your fork, need this too
sure thats why the build instructions are there in the readme.md file
sorry, may be i am reading the wrong readme.md at https://github.com/runpod-workers/worker-vllm/blob/main/README.md, can you please guide me with section or link (if the link is wrong)
This is the correct documentation