Orca234
RRunPod
•Created by Orca234 on 10/21/2024 in #⚡|serverless
Batch processing of chats
Processing a batch chat completions.
Hi,
I am new to Runpod and am trying to adapt my project so I can use it with the serverless interface. My project does work fine on AWS using offline vLLM inference via the langchain library. My understanding is that to use RunPod Serverless, I will have to use the OpenAI interface. Now, what I don't understand is how to implement vLLM batch processing, like I do now with offline inference, using this API. The client.chat.completions.create() method seems to only take one chat (with multiple messages) at a time, but not multiple independent chats (each consisting of multiple messages). The RunPod documentation also only covers the single chat case. Is there a way and if how to send a batch of chats at once? This is important for my logic as using the prefix-caching-enabled option makes a big difference.
Hope this question is not too dumb, cheers.
4 replies