How to stream via OPENAI BASE URL?
Does the OPENAI BASE URL support Server-sent events (SSE) type of streaming?
I was working previously with Ooba streaming was working fine. Since we switched to vLLM/Serverless it is no longer working.
If this is not done via SSE, Is there perhaps any tutorial you could recommend how to achieve streaming, please?
15 Replies
SSE is the recommended way for an openai (compatible) API: https://platform.openai.com/docs/api-reference/streaming
I have a bad feeling RunPod doesm't support this yet. If not, please put this on the roadmap. Thanks
Actually I think this was a false alarm. I was looking at this documentation, which I believe is outdated:
https://doc.runpod.io/reference/llama2-13b-chat
It is no longer needed to do a loop. It seems the OpenAI Base Url:
https://api.runpod.ai/v2/vllm-{{endpoint_id}}/openai/v1
with stream:true
is already supporting SSE specification.
Amazing.It's a different endpoint sorry
Yes it's the right endpoint for openai, supports the client library well
Check out the docs for the right openai endpoint usage
Sorry, maybe I wasn't able to find the most up-to-date docs. The ones I found keep mentioning Run endpoint with loop.
RunPod need to do better with Docs and explain the OpenAI endpoint better. In fact the whole Run and SyncRun are not needed when using OpenAI endpoints. Docs need to highlight that.
vLLM Endpoint | RunPod Documentation
Run any LLM model with RunPod's vLLM Worker.
This one
Get started | RunPod Documentation
RunPod provides a simple way to run large language models (LLMs) as a Serverless Endpoint.
Yes, but if you look at the Streaming section it assumes the user is utilising Python with the
Thanks
OpenAI
library.
And it suggests to do a loop:
This is not needed, because the OpenAI endpoint is already supporting SSE. An SSE implementation in any programming language will take care of looping automatically.
It would be good to mention that.Thanks
Ohh yeah yeah
would you mind helping me by creating a #🧐|feedback on that? dont forget to also mention the docs link please 🙂
Sure 🙂
Done
noice thanks bro 🙂
thats great
Im curious what will the code look like if we use the looping that is automatically handled by the code
can you send it here? i wanna see it
@houmie
Sure.
Server-sent events (SSE) clients simplify the process of handling asynchronous data. Unlike traditional methods where a POST request and repeated polling are necessary to retrieve and monitor job data until completion, SSE automates this process by managing asynchronous events behind the scenes. For Swift implementations, you can use https://github.com/launchdarkly/swift-eventsource. This library allows replacing semaphores with a more modern async/continuation pattern, eliminating the need for polling. The client listens to the incoming data stream and triggers onClosed() when the stream ends. Code example: https://github.com/launchdarkly/swift-eventsource/issues/75#issuecomment-2032533650 Although I have limited experience with Python's SSE clients, I have tested the sseclient-py package available at https://github.com/mpetazzoni/sseclient. Another notable Python SSE client is detailed at https://github.com/btubbs/sseclient with a useful guide here: https://maxhalford.github.io/blog/flask-sse-no-deps/. I am not sure if Python's implementation is non-blocking as I have not extensively used it. However, ideally, it should be asynchronous to serve its purpose effectively. In Swift, I can confirm that it functions perfectly without blocking.
Server-sent events (SSE) clients simplify the process of handling asynchronous data. Unlike traditional methods where a POST request and repeated polling are necessary to retrieve and monitor job data until completion, SSE automates this process by managing asynchronous events behind the scenes. For Swift implementations, you can use https://github.com/launchdarkly/swift-eventsource. This library allows replacing semaphores with a more modern async/continuation pattern, eliminating the need for polling. The client listens to the incoming data stream and triggers onClosed() when the stream ends. Code example: https://github.com/launchdarkly/swift-eventsource/issues/75#issuecomment-2032533650 Although I have limited experience with Python's SSE clients, I have tested the sseclient-py package available at https://github.com/mpetazzoni/sseclient. Another notable Python SSE client is detailed at https://github.com/btubbs/sseclient with a useful guide here: https://maxhalford.github.io/blog/flask-sse-no-deps/. I am not sure if Python's implementation is non-blocking as I have not extensively used it. However, ideally, it should be asynchronous to serve its purpose effectively. In Swift, I can confirm that it functions perfectly without blocking.
i feel like thats a nice way already to demonstrate how to receive the stream, but is there some other way that you can think of ? like without using any library
that looks fine already ( the for loop ) and doesn't seem to be need to be updated unless there is some other great alternative right
for an example
Sure, I don't mind. The ones above are SSE libraries. I just thought I point it out. But yeah you can keep it as it is, just mention that SSE is supported and maybe it would be beneficial to mention the difference between OpenAPI Base URL and Run/ SyncRun. But it's really up to you.
Yeah that difference has to be further explained
Please use the github worker-vllm docs for now, docs.runpod.io ones are being updated