j
j
RRunPod
Created by j on 1/23/2025 in #⛅|pods
How to queue requests to vLLM pods?
Hi there, I run an AI chat site (https://www.hammerai.com) with ~100k users. I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (Container Image: vllm/vllm-openai:latest) because serverless was getting very expensive. Currently I have three pods spun up and a Next.js API which uses the Vercel ai SDK to call one of the three pods (I just choose one of the three randomly). This works okay as a fake load balancer, but sometimes the pods are all busy and I fail with:
Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
cause: undefined,
reason: 'maxRetriesExceeded',
errors: [
APICallError [AI_APICallError]: Bad Gateway
Error RetryError [AI_RetryError]: Failed after 3 attempts. Last error: Bad Gateway
at _retryWithExponentialBackoff (/var/task/apps/web/.next/server/chunks/8499.js:5672:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async startStep (/var/task/apps/web/.next/server/chunks/8499.js:9353:171)
at async fn (/var/task/apps/web/.next/server/chunks/8499.js:9427:99)
at async /var/task/apps/web/.next/server/chunks/8499.js:5808:28
at async POST (/var/task/apps/web/.next/server/app/api/cloud/chat/route.js:238:26)
at async /var/task/apps/web/.next/server/chunks/9854.js:5600:37 {
cause: undefined,
reason: 'maxRetriesExceeded',
errors: [
APICallError [AI_APICallError]: Bad Gateway
A few questions: 1. Is there any suggested way to handle queueing requests? 2. Is there any suggested way to distribute requests between pods? 3. Are there any nice libraries or example projects which show how to do this? Thank you for any help!
3 replies
RRunPod
Created by j on 1/22/2025 in #⛅|pods
Model Maximum Context Length Error
Hi there, I run an AI chat site (https://www.hammerai.com). I was previously using vLLM serverless, but switched over to using dedicated Pods with the vLLM template (Container Image: vllm/vllm-openai:latest. Here is my configuration:
--host 0.0.0.0 --port 8000 --model LoneStriker/Fimbulvetr-11B-v2-AWQ --enforce-eager --gpu-memory-utilization 0.95 --api-key foo --max-model-len 4096 --max-seq-len-to-capture 4096 --trust-remote-code --chat-template "{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }} {% for message in messages %} {% if message['role'] == 'user' %} ### Instruction: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% elif message['role'] == 'assistant' %} ### Response: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% elif message['role'] == 'user_context' %} ### Input: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% endif %} {% endfor %} {% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} ### Response: {% endif %}"
--host 0.0.0.0 --port 8000 --model LoneStriker/Fimbulvetr-11B-v2-AWQ --enforce-eager --gpu-memory-utilization 0.95 --api-key foo --max-model-len 4096 --max-seq-len-to-capture 4096 --trust-remote-code --chat-template "{{ (messages|selectattr('role', 'equalto', 'system')|list|last).content|trim if (messages|selectattr('role', 'equalto', 'system')|list) else '' }} {% for message in messages %} {% if message['role'] == 'user' %} ### Instruction: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% elif message['role'] == 'assistant' %} ### Response: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% elif message['role'] == 'user_context' %} ### Input: {{ message['content']|trim -}} {% if not loop.last %} {% endif %} {% endif %} {% endfor %} {% if add_generation_prompt and messages[-1]['role'] != 'assistant' %} ### Response: {% endif %}"
I then call it with:
import {convertToCoreMessages, streamText} from 'ai' // the vercel ai sdk

export async function POST(req: NextRequest): Promise<Response> {
...
// Depending on whether it is a chat or a completion, send `messages` or `prompt`:
const response = await streamText({
...(generateChat
? {messages: convertToCoreMessages(generateChat.messages)}
: {prompt: generateCompletion?.prompt}),
import {convertToCoreMessages, streamText} from 'ai' // the vercel ai sdk

export async function POST(req: NextRequest): Promise<Response> {
...
// Depending on whether it is a chat or a completion, send `messages` or `prompt`:
const response = await streamText({
...(generateChat
? {messages: convertToCoreMessages(generateChat.messages)}
: {prompt: generateCompletion?.prompt}),
20 replies