Worker handling multiple requests concurrently
I have an application where a single worker can handle multiple requests concurrently.
I'm not finding a way of allowing this in runpod serverless. The multiple requests are always queued when using a single worker. Is this possible?
9 Replies
you can search here, we have answered this multiple times, also use #🤖|ask-ai it should be able to answer it
Thanks @flash-singh . I did search but didn't return any results.
Tried different keywords, now I got one post that points me towards this: https://github.com/runpod-workers/worker-vllm/blob/main/src/handler.py
So I guess the magic bit is the "concurrency_modifier" arg in serverless start.
FYI, this argument is not documented anywhere in the runpod.io docs, at least I couldn't find it.
GitHub
worker-vllm/src/handler.py at main · runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - runpod-workers/worker-vllm
Would be useful for it to be documented, I agree
yes thats it, @Justin lets document this in serverless docs and git
Is it possible? One worker handling more than one request concurrently?
yes he shared the link to worker which uses that
Got it on the backlog, will work with @PatrickR to get this implemented.
It also seems that
concurrency_modifier
doesn't work in this example. Please see this issue: https://github.com/runpod-workers/worker-vllm/issues/36GitHub
MAX_CONCURRENCY
parameter doesn't work · Issue #36 · runpod-worke...Current behaviour: When sending multiple requests with a short interval (e.g. 1 second) to the endpoint with 1 worker enabled, all the requests skip the queue and are being passed to the worker. (T...
Justin, is this documented, please ? I mean the way to have one worker handling more than one request concurrently