柠檬板烧鸡
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
I'm not sure, we didn't test it with a mix of different paragraphs
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
The conclusion we reached is that
The batch size is 64, which requires a mixture of different paragraphs, so that some prompts are processed in the pre-fill stage (high arithmetic intensivity) and some prompts are processed in the decode stage (low arithemtic intensitivy), so that the parallel function of the tensor core can be fully returned in theory.
We did not continue to move forward in this direction
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
thanks
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
My idea is to update the GPU, modify the runpod environment variable, and modify the vllm environment variable to test the time consumption. If the time of 64 samplings cannot be close to the time of 1 sampling, then it means that what my technical team leader said is wrong.
Our technical team leader did not provide reference materials, but another colleague of mine is looking for relevant papers for verification.
If the time of 64 samplings is close to the time of 1 sampling, then multiple samplings can be used to select the optimal solution, which can improve product quality.
Our technical team leader’s point of view is 【There should be a millisecond difference between the time of 64 samplings and the time of 1 sampling】
The tests I am doing now are to verify the technical team leader’s point of view. If it exists, it will be better. If not, you have to come up with a conclusion to convince him.
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
If this single batch is 2.95s, then I expect the time range for 64 batches should be 2.0~3.95s
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
what?
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
Our technical team leader told me this. I remained skeptical and verified this 【How to optimize batch processing performance】
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
Yes, the overall speed is improved, but there is still a large gap between the speed of 64 samples and 1 sample, and [How to optimize batch processing performance] seems unable to be approached by replacing a better GPU.
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
141GB, same configuration, test results are as follows
1 sampling time: 2.95 s
16 sampling time: 7.50 s
32 sampling time: 8.59 s
64 sampling time: 12.02 s
Compared with 48GB, it can reduce the time consumption, but there is still a situation that 64 sampling times take much longer than 1 sampling time.
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
ok thank you very much
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?

64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
select this?
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
OK, thanks, but serverless can only set up RTX A6000 and A40, and there is no way to choose a better GPU.
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
If I change to a better GPU, can I achieve the goal of [I think the time it takes to process 64 batches should be close to the time it takes to process a single batch. Currently, 64 batches take too long.]?
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
I think the time it takes to process 64 batches should be close to the time it takes to process a single batch. Currently, 64 batches take too long.
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
这个是serverless 的worker 的日志。
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
MAX_NUM_BATCHED_TOKENS = 8000
1 sampling time: 11.02 s
16 sampling time: 22.84 s
32 sampling time: 23.52 s
64 sampling time: 45.85 s
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
I am trying to keep increasing the value of MAX_NUM_BATCHED_TOKENS
64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/12/2025 in #⚡|serverless
How to optimize batch processing performance?
I didn't use the openAI SDK. My request code was generated by postman. Is there any optimization in OpenAI SDK in this regard?
【try 32 batch only, what happens with the time】 There is not much fluctuation. The time consumption will increase with the batch size for different batches
【do you get Sequence group X is preempted due to insufficient KV cache space too?】 I don't quite understand this sentence
Added
ENABLE_CHUNKED_PREFILL = true
and MAX_NUM_BATCHED_TOKENS=4000
. The time consumption of 64 batches and 32 batches is still much different than that of a single batch.
ENABLE_CHUNKED_PREFILL = true
and MAX_NUM_BATCHED_TOKENS = 4000
1 sample time: 11.15 s
16 samples time: 22.64 s
32 samples time: 21.78 s
64 samples time: 35.29 s64 replies
RRunPod
•Created by 柠檬板烧鸡 on 3/5/2025 in #⚡|serverless
how can I check the logs to see if my request uses the lora model
Thanks for your suggestion
The serverless package is runpod packaged and not open to the public. So I can't modify the log printing logic
The model list returned by model list contains lora model
Requesting cn_writer can get the result smoothly.
It's just that the log does not record the use of the model of the request, which makes me unsure whether the lora part is being used.
When deployed in pod mode, each request will be printed and marked with the lora used
12 replies