RunPod•8mo ago

TTL for vLLM endpoint

Is there a way to specify TTL value when calling a vLLM endpoint via OpenAI-compatible API?

11 Replies

You can set a timeout value in the endpoint, like this.

CoderikOP•8mo ago

But this is execution timeout. The time spent waiting in the queue does not count as far as I can tell. What I'd like to achieve is discard a task that was sitting in the queue longer than it's TTL. In my case there is a timeout on the caller's side, so the response from such a task will not be received anyways.

Encyrption•8mo ago

I don't think you can timeout based upon time in QUEUE.

CoderikOP•8mo ago

There is policy.ttl parameter for regular tasks (https://docs.runpod.io/serverless/endpoints/send-requests#execution-policies), but not for OpenAI-compatible API powered by vLLM (https://github.com/runpod-workers/worker-vllm). When I use https://api.runpod.ai/v2/{ID}/openai/v1 endpoint, the OpenAI's input format is enforced, so I cannot pass policy there. Based on the worker-vllm code, it seems that at some moment the (OpenAI-compatible) payload is wrapped in the input field, so that the rest of the scheduling and handling can happen. I assume that the capability to handle TTL is there, I just cannot figure out, how to pass the config. Am I missing something?

Encyrption•8mo ago

Have you tried posting the data you are looking to pass?

CoderikOP•8mo ago

Yes, I'm getting 400 status and validation errors

Encyrption•8mo ago

Can you show the code for your handler?

CoderikOP•8mo ago

Sure, I'm using the runpod's vLLM worker: https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Encyrption•8mo ago