Thibaud
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
still nothing @Tim aka NERDDISCO ?
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
any news with your internal team?
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
coding with async seems a little helpful (i think i got 2 "inprogress" at the same time but never more even flooding with dozens of lite requests (openai) on 2 x H100s with 2GPU/worker with a 8B model)
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i did some more test today without big success... just burning few dozens of bucks.
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i hope you'll find a way to solve the remaining issue
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
ofc. I'll try to resume all of that
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
but the issue is not here with vLLM
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i compare code between vllm and sglang and i don't see what can be negative on the sglang one
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i tried that without success yet.
if you find something, lmk
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
yes. it's a "good" news. The issue is known now.
=> i found the issue (no batch but sequencial) but impossible for me to fix.
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i fix my repo.
it's the version "working" with the most updated version (python, sglang, flashinfer) to reproduce at best the sglang docker used on pod.
i think the issue is related to runpod serverless thing (and/or may be related to async/await).
you can take a look at "my" version here:
https://github.com/supa-thibaud/worker-sglang
docker here:
supathibaud/sglang-runpod-serverless
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
thanks a lot.
i don't think i can do much on my side
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
so it doesn't use its optimizatiosn to decode/encode in batch. it's completly useless if we don't find the correct setup.
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
on pod, request are handled in batch
on serverless, one after the other.
120 replies
RRunPod
•Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
Serverless:
"message":"[gpu=0] Decode batch. #running-req: 1, #token: 2303, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 "
Finished running generator.
2024-08-23 10:50:19.804
[96xlptgpmheem3]
[info]
_client.py :1026 2024-08-23 01:50:19,803 HTTP Request: POST http://0.0.0.0:30000/v1/chat/completions "HTTP/1.1 200 OK"
2024-08-23 10:50:19.803
[96xlptgpmheem3]
[info]
INFO: 127.0.0.1:55604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-08-23 10:50:19.635
[96xlptgpmheem3]
[info]
[gpu=0] Decode batch. #running-req: 1, #token: 2294, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0
2024-08-23 10:50:18.544
[96xlptgpmheem3]
[info]
[gpu=0] Decode batch. #running-req: 1, #token: 2254, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0
2024-08-23 10:50:17.454
[96xlptgpmheem3]
[info]
[gpu=0] Decode batch. #running-req: 1, #token: 2214, token usage: 0.01, gen throughput (token/s): 16.40, #queue-req: 0
2024-08-23 10:50:17.273
[96xlptgpmheem3]
[info]
[gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 95.59%, #running-req: 0, #queue-req: 0
Pod:
2024-08-23T08:53:25.799936753Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 74.92%, #running-req: 0, #queue-req: 0
2024-08-23T08:53:25.856204372Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 83.26%, #running-req: 1, #queue-req: 0
2024-08-23T08:53:26.118457662Z [gpu=0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 6624, cache hit rate: 88.82%, #running-req: 3, #queue-req: 0
2024-08-23T08:53:26.148959295Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 89.94%, #running-req: 6, #queue-req: 0
2024-08-23T08:53:26.173732167Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 90.85%, #running-req: 7, #queue-req: 0
2024-08-23T08:53:26.418726222Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 92.25%, #running-req: 8, #queue-req: 0
[gpu=0] Decode batch. #running-req: 10, #token: 2799, token usage: 0.01, gen throughput (token/s): 397.74, #queue-req: 0
120 replies