How to optimize batch processing performance?

Use serverless to deploy Qwen/Qwen2-7B model GPU: Nivada A40 48G Environment variables: MODEL_NAME=Qwen/Qwen2-7B HF_TOKEN=xxx ENABLE_LORA=True LORA_MODULES={"name": "cn_writer", "path": "{huggingface_model_name}", "base_model_name": "Qwen/Qwen2-7B"} MAX_LORA_RANK=64 MIN_BATCH_SIZE=384 ENABLE_PREFIX_CACHING=1 My problem: Batch processing takes too long, which is 3-4 times the time of a single request. How should I reduce the time consumption of this batch processing? My code is in the attachment Phenomenon: The time consumption of 64 batch processing requests is 4 times that of a single batch processing request. What I expect is how to make the time of 64 batch processing close to the time of single batch processing
31 Replies
Jason
Jason4w ago
does this happens too when you use openai sdk? set the url to runpod's endpoint
url = f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1"
url = f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1"
try 32 batch only, what happens with the time do you get
Sequence group X is preempted due to insufficient KV cache space
Sequence group X is preempted due to insufficient KV cache space
too? maybe with 64 batch in that gpu, its fully utilized so it might experience more latency / slowdowns try setting ENABLE_CHUNKED_PREFILL to true in your env tell me how it goes after you try it bro
柠檬板烧鸡
柠檬板烧鸡OP4w ago
I didn't use the openAI SDK. My request code was generated by postman. Is there any optimization in OpenAI SDK in this regard? 【try 32 batch only, what happens with the time】 There is not much fluctuation. The time consumption will increase with the batch size for different batches 【do you get Sequence group X is preempted due to insufficient KV cache space too?】 I don't quite understand this sentence Added ENABLE_CHUNKED_PREFILL = true and MAX_NUM_BATCHED_TOKENS=4000. The time consumption of 64 batches and 32 batches is still much different than that of a single batch. ENABLE_CHUNKED_PREFILL = true and MAX_NUM_BATCHED_TOKENS = 4000 1 sample time: 11.15 s 16 samples time: 22.64 s 32 samples time: 21.78 s 64 samples time: 35.29 s I am trying to keep increasing the value of MAX_NUM_BATCHED_TOKENS MAX_NUM_BATCHED_TOKENS = 8000 1 sampling time: 11.02 s 16 sampling time: 22.84 s 32 sampling time: 23.52 s 64 sampling time: 45.85 s
Jason
Jason4w ago
I dont think there's any difference in optimization just neater code, it's fine, yes I think this is normal then for 64 to take longer Since the gpu is loaded up and strained the more you go up Check your worker logs, maybe Share it here? The full log What do you think is the problem it shouldn't be like that?
柠檬板烧鸡
柠檬板烧鸡OP4w ago
这个是serverless 的worker 的日志。
Jason
Jason4w ago
Ok thanks but what's the problem with the time? More batch size = more time isn't it? Especially if your gpu has small vram compared to the usage
柠檬板烧鸡
柠檬板烧鸡OP4w ago
I think the time it takes to process 64 batches should be close to the time it takes to process a single batch. Currently, 64 batches take too long. If I change to a better GPU, can I achieve the goal of [I think the time it takes to process 64 batches should be close to the time it takes to process a single batch. Currently, 64 batches take too long.]?
Jason
Jason4w ago
I see Yeah you can try that Clone the endpoint or set a new gpu
柠檬板烧鸡
柠檬板烧鸡OP4w ago
OK, thanks, but serverless can only set up RTX A6000 and A40, and there is no way to choose a better GPU.
Jason
Jason4w ago
More vram
柠檬板烧鸡
柠檬板烧鸡OP4w ago
select this?
柠檬板烧鸡
柠檬板烧鸡OP4w ago
No description
Jason
Jason4w ago
I'm not sure how to achieve faster, maybe more vram or you can try to use other env variables ( vllm config) Ya check the others
柠檬板烧鸡
柠檬板烧鸡OP4w ago
ok thank you very much 141GB, same configuration, test results are as follows 1 sampling time: 2.95 s 16 sampling time: 7.50 s 32 sampling time: 8.59 s 64 sampling time: 12.02 s Compared with 48GB, it can reduce the time consumption, but there is still a situation that 64 sampling times take much longer than 1 sampling time.
Jason
Jason4w ago
i see yeah the gap seems to close up in bigger gpu's bigger gpus, can use bigger batch sizes
柠檬板烧鸡
柠檬板烧鸡OP4w ago
Yes, the overall speed is improved, but there is still a large gap between the speed of 64 samples and 1 sample, and [How to optimize batch processing performance] seems unable to be approached by replacing a better GPU.
Jason
Jason4w ago
hmm of course it has gaps, may i know where do you find this? where 64 batch time should be the same time as single batch close like for example here 2.95 seconds* in single batch then how many seconds are you hoping the 64 batch to be like? any references for that? im curious but i think i might not be able to provide solutions 🙂
柠檬板烧鸡
柠檬板烧鸡OP4w ago
Our technical team leader told me this. I remained skeptical and verified this 【How to optimize batch processing performance】 what? If this single batch is 2.95s, then I expect the time range for 64 batches should be 2.0~3.95s
Jason
Jason4w ago
i see, and any references for that? "hope/expectation" or why do you think so
柠檬板烧鸡
柠檬板烧鸡OP4w ago
My idea is to update the GPU, modify the runpod environment variable, and modify the vllm environment variable to test the time consumption. If the time of 64 samplings cannot be close to the time of 1 sampling, then it means that what my technical team leader said is wrong. Our technical team leader did not provide reference materials, but another colleague of mine is looking for relevant papers for verification. If the time of 64 samplings is close to the time of 1 sampling, then multiple samplings can be used to select the optimal solution, which can improve product quality. Our technical team leader’s point of view is 【There should be a millisecond difference between the time of 64 samplings and the time of 1 sampling】 The tests I am doing now are to verify the technical team leader’s point of view. If it exists, it will be better. If not, you have to come up with a conclusion to convince him.
Jason
Jason4w ago
i see yeah good luck on that too!
柠檬板烧鸡
柠檬板烧鸡OP4w ago
thanks
Jason
Jason4w ago
but also if you can, please share the result here i wanna know if its true or not 🙂
riverfog7
riverfog74w ago
low batch sizes are limited by VRAM bandwidth and high batch size is limited by core compute low batch: low throughput low latency high batch : high throughput high latency you have find the middle ground
Jason
Jason4w ago
oh i see do you know if that statement "there should be a millisecond difference between t he time of 64 sampling and 1 sampling" is true? assuming using 7-8b model only
riverfog7
riverfog74w ago
probably not because that gpu has to do prompt processing which is 64 times more in batch size 64 than batch size 1 but on 70B models with 4x A40 / A6000 gpus i get roughly this batch 1: about 40tok/s batch 70: about 500tok/s -> not accurate cuz i forgot the actual number prompt processing: ~1000tok/s btw an a40 is just an underclocked dataccenter A6000
Jason
Jason4w ago
the throughput if displayed tok/s is crazy boost in higher batch count ic interesting fact can i add you on discord?
riverfog7
riverfog74w ago
already accepted 🙂
Jason
Jason4w ago
oh haha okay
柠檬板烧鸡
柠檬板烧鸡OP2w ago
The conclusion we reached is that The batch size is 64, which requires a mixture of different paragraphs, so that some prompts are processed in the pre-fill stage (high arithmetic intensivity) and some prompts are processed in the decode stage (low arithemtic intensitivy), so that the parallel function of the tensor core can be fully returned in theory. We did not continue to move forward in this direction
Jason
Jason2w ago
Oh thanks for sharing back! So the time difference of milliseconds or even seconds are not true?
柠檬板烧鸡
柠檬板烧鸡OP2w ago
I'm not sure, we didn't test it with a mix of different paragraphs

Did you find this page helpful?