TTL for vLLM endpoint
Is there a way to specify TTL value when calling a vLLM endpoint via OpenAI-compatible API?
11 Replies
You can set a timeout value in the endpoint, like this.
But this is execution timeout. The time spent waiting in the queue does not count as far as I can tell. What I'd like to achieve is discard a task that was sitting in the queue longer than it's TTL. In my case there is a timeout on the caller's side, so the response from such a task will not be received anyways.
I don't think you can timeout based upon time in QUEUE.
There is
policy.ttl
parameter for regular tasks (https://docs.runpod.io/serverless/endpoints/send-requests#execution-policies), but not for OpenAI-compatible API powered by vLLM (https://github.com/runpod-workers/worker-vllm).
When I use https://api.runpod.ai/v2/{ID}/openai/v1
endpoint, the OpenAI's input format is enforced, so I cannot pass policy
there. Based on the worker-vllm code, it seems that at some moment the (OpenAI-compatible) payload is wrapped in the input
field, so that the rest of the scheduling and handling can happen. I assume that the capability to handle TTL is there, I just cannot figure out, how to pass the config. Am I missing something?Have you tried posting the data you are looking to pass?
Yes, I'm getting 400 status and validation errors
Can you show the code for your handler?
Sure, I'm using the runpod's vLLM worker: https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
If you do not modify the source code you cannot pass any additional arguments.
After digging, I think it cannot be done even by modifying the vllm worker's code. I've reached out to the support to clarify.
cant be done