Thibaud
Thibaud
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
still nothing @Tim aka NERDDISCO ?
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
ok, let me know!
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
any news with your internal team?
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
coding with async seems a little helpful (i think i got 2 "inprogress" at the same time but never more even flooding with dozens of lite requests (openai) on 2 x H100s with 2GPU/worker with a 8B model)
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i did some more test today without big success... just burning few dozens of bucks.
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i hope you'll find a way to solve the remaining issue
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
done!
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
ofc. I'll try to resume all of that
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
but the issue is not here with vLLM
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
No description
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
yes
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i compare code between vllm and sglang and i don't see what can be negative on the sglang one
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i tried that without success yet. if you find something, lmk
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
exactly
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
yes. it's a "good" news. The issue is known now. => i found the issue (no batch but sequencial) but impossible for me to fix.
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
i fix my repo. it's the version "working" with the most updated version (python, sglang, flashinfer) to reproduce at best the sglang docker used on pod. i think the issue is related to runpod serverless thing (and/or may be related to async/await). you can take a look at "my" version here: https://github.com/supa-thibaud/worker-sglang docker here: supathibaud/sglang-runpod-serverless
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
thanks a lot. i don't think i can do much on my side
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
so it doesn't use its optimizatiosn to decode/encode in batch. it's completly useless if we don't find the correct setup.
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
on pod, request are handled in batch on serverless, one after the other.
120 replies
RRunPod
Created by Thibaud on 8/20/2024 in #⚡|serverless
SGLang
Serverless: "message":"[gpu=0] Decode batch. #running-req: 1, #token: 2303, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 " Finished running generator. 2024-08-23 10:50:19.804 [96xlptgpmheem3] [info] _client.py :1026 2024-08-23 01:50:19,803 HTTP Request: POST http://0.0.0.0:30000/v1/chat/completions "HTTP/1.1 200 OK" 2024-08-23 10:50:19.803 [96xlptgpmheem3] [info] INFO: 127.0.0.1:55604 - "POST /v1/chat/completions HTTP/1.1" 200 OK 2024-08-23 10:50:19.635 [96xlptgpmheem3] [info] [gpu=0] Decode batch. #running-req: 1, #token: 2294, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 2024-08-23 10:50:18.544 [96xlptgpmheem3] [info] [gpu=0] Decode batch. #running-req: 1, #token: 2254, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 2024-08-23 10:50:17.454 [96xlptgpmheem3] [info] [gpu=0] Decode batch. #running-req: 1, #token: 2214, token usage: 0.01, gen throughput (token/s): 16.40, #queue-req: 0 2024-08-23 10:50:17.273 [96xlptgpmheem3] [info] [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 95.59%, #running-req: 0, #queue-req: 0 Pod: 2024-08-23T08:53:25.799936753Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 74.92%, #running-req: 0, #queue-req: 0 2024-08-23T08:53:25.856204372Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 83.26%, #running-req: 1, #queue-req: 0 2024-08-23T08:53:26.118457662Z [gpu=0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 6624, cache hit rate: 88.82%, #running-req: 3, #queue-req: 0 2024-08-23T08:53:26.148959295Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 89.94%, #running-req: 6, #queue-req: 0 2024-08-23T08:53:26.173732167Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 90.85%, #running-req: 7, #queue-req: 0 2024-08-23T08:53:26.418726222Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 92.25%, #running-req: 8, #queue-req: 0 [gpu=0] Decode batch. #running-req: 10, #token: 2799, token usage: 0.01, gen throughput (token/s): 397.74, #queue-req: 0
120 replies