RunPod•5mo ago

Batch processing of chats

Processing a batch chat completions. Hi, I am new to Runpod and am trying to adapt my project so I can use it with the serverless interface. My project does work fine on AWS using offline vLLM inference via the langchain library. My understanding is that to use RunPod Serverless, I will have to use the OpenAI interface. Now, what I don't understand is how to implement vLLM batch processing, like I do now with offline inference, using this API. The client.chat.completions.create() method seems to only take one chat (with multiple messages) at a time, but not multiple independent chats (each consisting of multiple messages). The RunPod documentation also only covers the single chat case. Is there a way and if how to send a batch of chats at once? This is important for my logic as using the prefix-caching-enabled option makes a big difference. Hope this question is not too dumb, cheers.

3 Replies

3WaD•5mo ago

Perhaps you're looking for vLLM continuous batching and RunPod Concurrent Handler ?

Orca234OP•5mo ago

Fantastic resources, cheers!

nerdylive•5mo ago

Have you find the code to queue up multiple messages like you described before

Gaming

Programming

Batch processing of chats

Did you find this page helpful?