R
RunPod6d ago
j

How to queue requests to vLLM pods?

Hi there, I run an AI chat site (https://www.hammerai.com) with ~100k users. I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (Container Image: vllm/vllm-openai:latest) because serverless was getting very expensive. Currently I have three pods spun up and a Next.js API which uses the Vercel ai SDK to call one of the three pods (I just choose one of the three randomly). This works okay as a fake load balancer, but sometimes the pods are all busy and I fail with:
Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
cause: undefined,
reason: 'maxRetriesExceeded',
errors: [
APICallError [AI_APICallError]: Bad Gateway
Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
cause: undefined,
reason: 'maxRetriesExceeded',
errors: [
APICallError [AI_APICallError]: Bad Gateway
A few questions: 1. Is there any suggested way to handle queueing requests? 2. Is there any suggested way to distribute requests between pods? 3. Are there any nice libraries or example projects which show how to do this? Thank you for any help!
1 Reply
nerdylive
nerdylive5d ago
3. api gateways for llms i guess.. i dont have any recommendations. but im sure you can search for alike projects 2. lb, weight using request, or request length with request amount i dont really know what i'm talking about, make sure to do more research about this

Did you find this page helpful?