j
How to queue requests to vLLM pods?
Hi there, I run an AI chat site (https://www.hammerai.com) with ~100k users.
I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (
Container Image: vllm/vllm-openai:latest
) because serverless was getting very expensive.
Currently I have three pods spun up and a Next.js API which uses the Vercel ai
SDK to call one of the three pods (I just choose one of the three randomly). This works okay as a fake load balancer, but sometimes the pods are all busy and I fail with:
A few questions:
1. Is there any suggested way to handle queueing requests?
2. Is there any suggested way to distribute requests between pods?
3. Are there any nice libraries or example projects which show how to do this?
Thank you for any help!3 replies
Model Maximum Context Length Error
Hi there, I run an AI chat site (https://www.hammerai.com). I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (
Container Image: vllm/vllm-openai:latest
. Here is my configuration:
I then call it with:
20 replies