RunPod•12mo ago

How to stream via OPENAI BASE URL?

Does the OPENAI BASE URL support Server-sent events (SSE) type of streaming? I was working previously with Ooba streaming was working fine. Since we switched to vLLM/Serverless it is no longer working. If this is not done via SSE, Is there perhaps any tutorial you could recommend how to achieve streaming, please?

15 Replies

houmieOP•12mo ago

SSE is the recommended way for an openai (compatible) API: https://platform.openai.com/docs/api-reference/streaming I have a bad feeling RunPod doesm't support this yet. If not, please put this on the roadmap. Thanks Actually I think this was a false alarm. I was looking at this documentation, which I believe is outdated: https://doc.runpod.io/reference/llama2-13b-chat It is no longer needed to do a loop. It seems the OpenAI Base Url: https://api.runpod.ai/v2/vllm-{{endpoint_id}}/openai/v1 with stream:true is already supporting SSE specification. Amazing.

Jason•12mo ago

It's a different endpoint sorry Yes it's the right endpoint for openai, supports the client library well Check out the docs for the right openai endpoint usage

houmieOP•12mo ago

Sorry, maybe I wasn't able to find the most up-to-date docs. The ones I found keep mentioning Run endpoint with loop. RunPod need to do better with Docs and explain the OpenAI endpoint better. In fact the whole Run and SyncRun are not needed when using OpenAI endpoints. Docs need to highlight that.

Jason•12mo ago

https://docs.runpod.io/category/vllm-endpoint

vLLM Endpoint | RunPod Documentation

Run any LLM model with RunPod's vLLM Worker.

Jason•12mo ago

This one

Jason•12mo ago

https://docs.runpod.io/serverless/workers/vllm/get-started

Get started | RunPod Documentation

RunPod provides a simple way to run large language models (LLMs) as a Serverless Endpoint.

houmieOP•12mo ago

Yes, but if you look at the Streaming section it assumes the user is utilising Python with the OpenAI library. And it suggests to do a loop:

for response in response_stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

for response in response_stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

This is not needed, because the OpenAI endpoint is already supporting SSE. An SSE implementation in any programming language will take care of looping automatically. It would be good to mention that.
Thanks

Jason•12mo ago

Ohh yeah yeah would you mind helping me by creating a #🧐｜feedback on that? dont forget to also mention the docs link please 🙂

houmieOP•12mo ago

Sure 🙂 Done

Jason•12mo ago

noice thanks bro 🙂 thats great Im curious what will the code look like if we use the looping that is automatically handled by the code can you send it here? i wanna see it @houmie

houmieOP•12mo ago

Sure.
Server-sent events (SSE) clients simplify the process of handling asynchronous data. Unlike traditional methods where a POST request and repeated polling are necessary to retrieve and monitor job data until completion, SSE automates this process by managing asynchronous events behind the scenes. For Swift implementations, you can use https://github.com/launchdarkly/swift-eventsource. This library allows replacing semaphores with a more modern async/continuation pattern, eliminating the need for polling. The client listens to the incoming data stream and triggers onClosed() when the stream ends. Code example: https://github.com/launchdarkly/swift-eventsource/issues/75#issuecomment-2032533650 Although I have limited experience with Python's SSE clients, I have tested the sseclient-py package available at https://github.com/mpetazzoni/sseclient. Another notable Python SSE client is detailed at https://github.com/btubbs/sseclient with a useful guide here: https://maxhalford.github.io/blog/flask-sse-no-deps/. I am not sure if Python's implementation is non-blocking as I have not extensively used it. However, ideally, it should be asynchronous to serve its purpose effectively. In Swift, I can confirm that it functions perfectly without blocking.

Jason•12mo ago

i feel like thats a nice way already to demonstrate how to receive the stream, but is there some other way that you can think of ? like without using any library that looks fine already ( the for loop ) and doesn't seem to be need to be updated unless there is some other great alternative right for an example

houmieOP•12mo ago

Sure, I don't mind. The ones above are SSE libraries. I just thought I point it out. But yeah you can keep it as it is, just mention that SSE is supported and maybe it would be beneficial to mention the difference between OpenAPI Base URL and Run/ SyncRun. But it's really up to you.

Jason•12mo ago

Yeah that difference has to be further explained

Alpay Ariyak•12mo ago

Please use the github worker-vllm docs for now, docs.runpod.io ones are being updated

Gaming

Programming

How to stream via OPENAI BASE URL?

Did you find this page helpful?