Serverless VLLM concurrency issue
Hello everyone, i deployed a serverless vllm (gemma 12b model) through runpod ui. withj 2 workers of A100 80GB vram.
if i send two requests at the same time, they both become IN PROGRESS but i recieve the ouput stream of one first, the second always waits for the first to finish then i start recieveing the tokens stream. why is it behaving live this?
47 Replies
Because it's processing using the batch one, is it the same worker?
Both requests are then in the same worker? If yes, then possible explanation is one gets processed first, then the second one. And your gpu isn't fast enough to process the inputs until the delay between them being processed is close to 0ms
Maybe your application is configured to also optimize for certain things so it processed the way it is ( 1 request then another one) in multiple stage of processing. Can't provide technical details currently but that's what I can guess
@Jason it is the same worker, is there any way i can make it respond to both of them at the same time?
you cant without changing code
bc of the natre of llms
if you give them the same input, the output may vary in length, which affects generation time
You can use faster gpus to reduce the delay
Or set batch processing to 1,so it just spawns a new worker which eventually will process in parallel but I think it will be a waste of resource
By setting ENV VARIABLE:
BATCH_SIZE = 1
i am using A100 80GB vram and it is supposed to be very fast!
before i used to deploy the same model on A100 40gb vram on gcp with vllm it it had no problem handling concurrent requests
DEFAULT_BATCH_SIZE or BATCH_SIZE ?
Same request, same prompt, same configurations?
I was telling about the BATCH_SIZE why?
yes same everything
Can you quantify how slow is it compared to hosted in gcp?
Like the stream delay first request, second in runpod and vs gcp
my issue is not really the speed, the speed is decent when there is no cold start, my issue is handling more than one request at the same time
How long is it that the delay for output stream in the second one vs first one
Yes I'm telling about the problem you described, delay between first and second request stream isn't it?
yes
first request starts streaming, second request from another client always starts after the first one finishes
with two workers?
ill do some benchmarks and provide you with the number s
2 and 3
tried both
can you check vllm logs
it should say metrics like
current running req, waiting req
etc
and tok/s
With vllm worker, batch size is usually more than 1,so one worker can handle multiple requests
do we need to set batch size with vllm workers?
I don't know why your endpoint is set to run multiple worker for only 2 request
vllm intellegently does batching until its kv cache is full
No it's the default that I was talking about
In the endpoint image vllm worker
It's 300 reqs batched default
Yup and this is about the endpoints request not vllm but that's why runpod configured batch for the endpoint too
Can you screenshot your whole edit endpoint details @Abdelrhman Nile
no i mean i was configuring the endpoint to scale up to multiple workers if needed
logs when sending 2 requests


right now it is configured to only have one worker
try default batch size to 10
i ma setting default batch size to 1 because i noticed streaming used to send very big chunks of tokens
Oh.. It's because
lol
Wait the request is just one?
i tried it with 50 and 256
that setting means only 1 request should be processed cocurrently
In your logs?
If you just set remove the batch size thingy, does they get processed in the same worker?
same behavior of not handling multiple requests with default batch size set to 50 and 256
And maybe there is an overhead of runpod's queue system
When you send a request it's going through runpod first then to the worker
Might introduce abit delay even everything is right
no no
sorry fo rthe misinformation
but both requests status appear as IN PROGRESS
its the batch size for streaming tokens
This is the real one but you didnt set it so should be fine

are you sure? i tried it with 5, 10, 50, 256 and i got the same behaviour
but let me try it one more time to confirm
uhh i mean it doesnt matter if you set it to 5 / 10 / etc
Oh ya.. Forgot about the name I meant this one
So maybe it's this
because it is related to token streaming, not the actual requests
@Abdelrhman Nile maybe can you try spamming requests? like 50+?
😅😅
set the max workers to 1 and then
spam requests
i kinda did that with vllm benchmark serving script, let me share the results with you
configuration was max workers = 3
and i was NOT setting default batch size , it was left on deafult which i believe is 50
also the script sent 1000 requests
only 857 was succesful
same model, same benchmark but on gcp a100 40 vram machine
will test that
When you initialize the vLLM engine (on cold start) you should see a log similar to this:
Maximum concurrency for 32768 tokens per request: 5.42x
as a part of vLLM's memory profiling. Make sure that the engine can perform concurrency > 2.
That being said, the official RunPod vLLM image, unfortunately, does not handle concurrency dynamically (it's hardcoded to 300 or static value), which will result in bottlenecking the jobs anyway. But it's definitely possible to stream multiple responses concurrently from a single serverless worker. Or at least it's working on my implementation.actually that doesnt matter you can batch even if you have less than 2x cocurrency if the requests fit in kv cache
anyways he has enough cache (the requests doesnt even use 5percent of the cache)
idk why it doesnot work either because everything is right