RunPod•10mo ago

cannot stream openai compatible response out

I have the below code for streaming the response, the generator is working but cannnot stream the response: llm = Llama(model_path="Phi-3-mini-4k-instruct-q4.gguf", n_gpu_layers=-1, n_ctx=4096, ) class JobInput: def init(self, job): self.openai_route = job.get("openai_route") self.openai_input = job.get("openai_input", {}) self.is_completion = "v1completions" in self.openai_route self.is_embedding = "embeddings" in self.openai_route self.embedding_format = self.openai_input.get('encoding_format', 'unknown') self.is_chatcompletion = "chat" in self.openai_route def infer(job_params): if 'n' in job_params.openai_input: del job_params.openai_input['n'] if job_params.openai_route and job_params.is_embedding: yield [ErrorResponse( message="The embedding endpoint is not supported on this URL.", type="unsupported_endpoint", code=501 # Not Implemented ).model_dump()] else: if job_params.openai_route and job_params.is_chatcompletion: llm_engine = llm.create_chat_completion else: llm_engine = llm.create_completion if not job_params.openai_input.get("stream", False): yield llm_engine(job_params.openai_input) elif job_params.openai_input.get("stream", False): llm_op = llm_engine(job_params.openai_input) yield llm_op async def handler(event): inp = event["input"] job_input = JobInput(inp) for line in infer(job_input): if isinstance(line, Generator): for l in line: yield l else: yield line if name == "main": runpod.serverless.start({"handler": handler, "return_aggregate_stream": True,}) Need help to fix!

5 Replies

digigoblin•10mo ago

Use the OpenAI SDK if you are not already doing so.

ngagefreak05OP•10mo ago

any directions how to use it?

digigoblin•10mo ago

Oh looks like you are implementing it yourself or something, I suggest using RunPod's vllm worker instead, its also available as a template from the Explore section. https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

ngagefreak05OP•10mo ago

but it is specifically for vllm, i am trying to use llama-cpp

Jason•10mo ago

try to read the docs on how to stream for that And is it compatible with openai compatible stream?

Gaming

Programming

cannot stream openai compatible response out

Did you find this page helpful?