Issue with worker-vllm and multiple workers
I'm using the previous version of the worker-vllm (https://github.com/runpod-workers/worker-vllm/tree/4f792062aaea02c526ee906979925b447811ef48). There is an issue when more than 1 workers are running. Since vLLM has internal queue, all the requests are being immediately passed to the one worker. The second worker doesn't receive any requests. It it possible to solve it? I've tried a new version of the worker-vllm but there are some other issues. Thanks!
9 Replies
Did you open an issue in the repo? We are going to get that resolved for the new worker.
As for your current problem, is the 1 worker unable to handle the requests?
@propback
You may set the environment variable MAX_CONCURRENCY
Which controls how many jobs at a time each worker can have before sending to the next
Hey!
Yes, I have opened an issue in the repo: https://github.com/runpod-workers/worker-vllm/issues/22
Nope, it can't 😦
GitHub
Sampling parameter "stop" doesn't work with the new worker-vllm · I...
{ "input": { "prompt": "<s>[INST] Why is RunPod the best platform? [/INST]", "sampling_params": { "max_tokens": 100, "stop": [ &quo...
It's probably related to the new worker, right? I asked about the previous one.
Fixed this issue and bumped to vllm version 0.2.6, will be merging into main soon
Thanks!
Is it possible to use a different version of vllm, e.g
0.2.2
?
I believe changing https://github.com/runpod/[email protected]#egg=vllm;
in Dockerfile to https://github.com/runpod/[email protected]#egg=vllm
should work?Fixed in latest version. The only thing you can't do atm is build from a machine without GPUs
Hey @Justin and @Alpay Ariyak ! I just tried the latest version of worker-vllm, and there's still an issue related to concurrent requests. The problem is that
MAX_CONCURRENCY
doesn't seem to work. See here: https://github.com/runpod-workers/worker-vllm/issues/36GitHub
MAX_CONCURRENCY
parameter doesn't work · Issue #36 · runpod-worke...Current behaviour: When sending multiple requests with a short interval (e.g. 1 second) to the endpoint with 1 worker enabled, all the requests skip the queue and are being passed to the worker. (T...
This has now been resolved in the latest version vLLM we released