R
RunPod3mo ago
houmie

How to stream via OPENAI BASE URL?

Does the OPENAI BASE URL support Server-sent events (SSE) type of streaming? I was working previously with Ooba streaming was working fine. Since we switched to vLLM/Serverless it is no longer working. If this is not done via SSE, Is there perhaps any tutorial you could recommend how to achieve streaming, please?
15 Replies
houmie
houmie3mo ago
SSE is the recommended way for an openai (compatible) API: https://platform.openai.com/docs/api-reference/streaming I have a bad feeling RunPod doesm't support this yet. If not, please put this on the roadmap. Thanks Actually I think this was a false alarm. I was looking at this documentation, which I believe is outdated: https://doc.runpod.io/reference/llama2-13b-chat It is no longer needed to do a loop. It seems the OpenAI Base Url: https://api.runpod.ai/v2/vllm-{{endpoint_id}}/openai/v1 with stream:true is already supporting SSE specification. Amazing.
nerdylive
nerdylive3mo ago
It's a different endpoint sorry Yes it's the right endpoint for openai, supports the client library well Check out the docs for the right openai endpoint usage
houmie
houmie3mo ago
Sorry, maybe I wasn't able to find the most up-to-date docs. The ones I found keep mentioning Run endpoint with loop. RunPod need to do better with Docs and explain the OpenAI endpoint better. In fact the whole Run and SyncRun are not needed when using OpenAI endpoints. Docs need to highlight that.
nerdylive
nerdylive3mo ago
This one
nerdylive
nerdylive3mo ago
Get started | RunPod Documentation
RunPod provides a simple way to run large language models (LLMs) as a Serverless Endpoint.
houmie
houmie3mo ago
Yes, but if you look at the Streaming section it assumes the user is utilising Python with the OpenAI library. And it suggests to do a loop:
for response in response_stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
for response in response_stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
This is not needed, because the OpenAI endpoint is already supporting SSE. An SSE implementation in any programming language will take care of looping automatically. It would be good to mention that.
Thanks
nerdylive
nerdylive2mo ago
Ohh yeah yeah would you mind helping me by creating a #🧐|feedback on that? dont forget to also mention the docs link please 🙂
houmie
houmie2mo ago
Sure 🙂 Done
nerdylive
nerdylive2mo ago
noice thanks bro 🙂 thats great Im curious what will the code look like if we use the looping that is automatically handled by the code can you send it here? i wanna see it @houmie
houmie
houmie2mo ago
Sure.
Server-sent events (SSE) clients simplify the process of handling asynchronous data. Unlike traditional methods where a POST request and repeated polling are necessary to retrieve and monitor job data until completion, SSE automates this process by managing asynchronous events behind the scenes. For Swift implementations, you can use https://github.com/launchdarkly/swift-eventsource. This library allows replacing semaphores with a more modern async/continuation pattern, eliminating the need for polling. The client listens to the incoming data stream and triggers onClosed() when the stream ends. Code example: https://github.com/launchdarkly/swift-eventsource/issues/75#issuecomment-2032533650 Although I have limited experience with Python's SSE clients, I have tested the sseclient-py package available at https://github.com/mpetazzoni/sseclient. Another notable Python SSE client is detailed at https://github.com/btubbs/sseclient with a useful guide here: https://maxhalford.github.io/blog/flask-sse-no-deps/. I am not sure if Python's implementation is non-blocking as I have not extensively used it. However, ideally, it should be asynchronous to serve its purpose effectively. In Swift, I can confirm that it functions perfectly without blocking.
nerdylive
nerdylive2mo ago
i feel like thats a nice way already to demonstrate how to receive the stream, but is there some other way that you can think of ? like without using any library that looks fine already ( the for loop ) and doesn't seem to be need to be updated unless there is some other great alternative right for an example
houmie
houmie2mo ago
Sure, I don't mind. The ones above are SSE libraries. I just thought I point it out. But yeah you can keep it as it is, just mention that SSE is supported and maybe it would be beneficial to mention the difference between OpenAPI Base URL and Run/ SyncRun. But it's really up to you.
nerdylive
nerdylive2mo ago
Yeah that difference has to be further explained
Alpay Ariyak
Alpay Ariyak2mo ago
Please use the github worker-vllm docs for now, docs.runpod.io ones are being updated