Abdelrhman Nile
Abdelrhman Nile
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
will test that
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
same model, same benchmark but on gcp a100 40 vram machine
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 346.74
Total input tokens: 1024000
Total generated tokens: 70328
Request throughput (req/s): 2.88
Output token throughput (tok/s): 202.83
Total Token throughput (tok/s): 3156.09
---------------Time to First Token----------------
Mean TTFT (ms): 172033.53
Median TTFT (ms): 178518.65
P99 TTFT (ms): 326714.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 357.45
Median TPOT (ms): 271.08
P99 TPOT (ms): 1728.97
---------------Inter-token Latency----------------
Mean ITL (ms): 263.52
Median ITL (ms): 151.98
P99 ITL (ms): 1228.35
==================================================
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 346.74
Total input tokens: 1024000
Total generated tokens: 70328
Request throughput (req/s): 2.88
Output token throughput (tok/s): 202.83
Total Token throughput (tok/s): 3156.09
---------------Time to First Token----------------
Mean TTFT (ms): 172033.53
Median TTFT (ms): 178518.65
P99 TTFT (ms): 326714.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 357.45
Median TPOT (ms): 271.08
P99 TPOT (ms): 1728.97
---------------Inter-token Latency----------------
Mean ITL (ms): 263.52
Median ITL (ms): 151.98
P99 ITL (ms): 1228.35
==================================================
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
also the script sent 1000 requests only 857 was succesful
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
============ Serving Benchmark Result ============
Successful requests: 857
Benchmark duration (s): 95.82
Total input tokens: 877568
Total generated tokens: 68965
Request throughput (req/s): 8.94
Output token throughput (tok/s): 719.70
Total Token throughput (tok/s): 9877.74
---------------Time to First Token----------------
Mean TTFT (ms): 42451.61
Median TTFT (ms): 42317.61
P99 TTFT (ms): 77811.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 472.19
Median TPOT (ms): 190.87
P99 TPOT (ms): 3881.05
---------------Inter-token Latency----------------
Mean ITL (ms): 182.12
Median ITL (ms): 0.01
P99 ITL (ms): 4703.27
==================================================
============ Serving Benchmark Result ============
Successful requests: 857
Benchmark duration (s): 95.82
Total input tokens: 877568
Total generated tokens: 68965
Request throughput (req/s): 8.94
Output token throughput (tok/s): 719.70
Total Token throughput (tok/s): 9877.74
---------------Time to First Token----------------
Mean TTFT (ms): 42451.61
Median TTFT (ms): 42317.61
P99 TTFT (ms): 77811.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 472.19
Median TPOT (ms): 190.87
P99 TPOT (ms): 3881.05
---------------Inter-token Latency----------------
Mean ITL (ms): 182.12
Median ITL (ms): 0.01
P99 ITL (ms): 4703.27
==================================================
configuration was max workers = 3 and i was NOT setting default batch size , it was left on deafult which i believe is 50
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
i kinda did that with vllm benchmark serving script, let me share the results with you
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
but let me try it one more time to confirm
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
are you sure? i tried it with 5, 10, 50, 256 and i got the same behaviour
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
but both requests status appear as IN PROGRESS
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
same behavior of not handling multiple requests with default batch size set to 50 and 256
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
i tried it with 50 and 256
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
i ma setting default batch size to 1 because i noticed streaming used to send very big chunks of tokens
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
right now it is configured to only have one worker
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
No description
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
No description
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
logs when sending 2 requests
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
no i mean i was configuring the endpoint to scale up to multiple workers if needed
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
tried both
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
2 and 3
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
ill do some benchmarks and provide you with the number s
81 replies
RRunPod
Created by Abdelrhman Nile on 4/16/2025 in #⚡|serverless
Serverless VLLM concurrency issue
yes first request starts streaming, second request from another client always starts after the first one finishes
81 replies