R
RunPod2mo ago
Orca234

Batch processing of chats

Processing a batch chat completions. Hi, I am new to Runpod and am trying to adapt my project so I can use it with the serverless interface. My project does work fine on AWS using offline vLLM inference via the langchain library. My understanding is that to use RunPod Serverless, I will have to use the OpenAI interface. Now, what I don't understand is how to implement vLLM batch processing, like I do now with offline inference, using this API. The client.chat.completions.create() method seems to only take one chat (with multiple messages) at a time, but not multiple independent chats (each consisting of multiple messages). The RunPod documentation also only covers the single chat case. Is there a way and if how to send a batch of chats at once? This is important for my logic as using the prefix-caching-enabled option makes a big difference. Hope this question is not too dumb, cheers.
3 Replies
3WaD
3WaD2mo ago
Perhaps you're looking for vLLM continuous batching and RunPod Concurrent Handler ?
Orca234
Orca234OP2mo ago
Fantastic resources, cheers!
nerdylive
nerdylive2mo ago
Have you find the code to queue up multiple messages like you described before
Want results from more Discord servers?
Add your server