R
RunPod3mo ago
Thibaud

SGLang

SGLang works very well in pod but impossible to run in serverless. the api route stay => error 404 i use the exact same config (docker, command line, port) in pod and serverless.
85 Replies
NERDDISCO
NERDDISCO3mo ago
@Thibaud have you tried https://github.com/runpod-workers/worker-sglang by any chance?
GitHub
GitHub - runpod-workers/worker-sglang: SGLang is yet another fast s...
SGLang is yet another fast serving framework for large language models and vision language models. - runpod-workers/worker-sglang
Thibaud
Thibaud3mo ago
yes. but it launch but using request on runpod ui or openai call give zero results
NERDDISCO
NERDDISCO3mo ago
Would you mind sending me a screenshot of a deployed worker with the request + result? Then I will open an issue on our repo and get an engineer looking at the problem.
Thibaud
Thibaud3mo ago
some screen, i hope it can help.
No description
No description
No description
No description
Thibaud
Thibaud3mo ago
i don't have any results. just nothing happens. maybe i have miss something on my config of the serverless. do you need something else?
NERDDISCO
NERDDISCO3mo ago
Nope this looks fine, thank you very much!
Thibaud
Thibaud3mo ago
thanks. i hope your team will find a solution or a tutorial if the error is between the keyboard and the chair. any news about that?
NERDDISCO
NERDDISCO3mo ago
nope not yet, sorry! Will keep you updated once we have something 🙏
NERDDISCO
NERDDISCO3mo ago
Hey @Thibaud, we just released a preview of the worker, would you mind testing this out to see if this has the same problems? https://hub.docker.com/r/runpod/worker-sglang/tags
Thibaud
Thibaud3mo ago
of course. I'll do that right now
NERDDISCO
NERDDISCO3mo ago
awesome, thank you very much!!!
Thibaud
Thibaud3mo ago
so it's almost ok ! but maybe the last issue i have is a bad configuration. tldr; i can't ping the cluster only some instance My configuration: container Image: runpod/worker-sglang:preview-cuda12.1.0 Container Start Command: python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-8B --context-length 8192 --host 0.0.0.0 --port 8000 Expose HTTP Ports: 8000, once launched, if i click on the running instance, click connect web and use the url (something like: https://xxx-8000.proxy.runpod.net/v1) it works. but i don't have something like OPENAI BASE URL https://api.runpod.ai/v2/vllm-xxxx/openai/v1 to ping the cluster
nerdylive
nerdylive3mo ago
what if you build the url yourself https://api.runpod.ai/v2/your-endpoint-id/openai/v1
Thibaud
Thibaud3mo ago
no. and none of the url works: error 401 each time.
No description
Thibaud
Thibaud3mo ago
i think the endpoint is not correctly "connected" to their instance
nerdylive
nerdylive3mo ago
401 is unauthorized tho does the /health also return 401? maybe check if your api key if valid, and create new endpoint
NERDDISCO
NERDDISCO3mo ago
@Thibaud thank you very much for testing this in depth, I will report this back to the team
Thibaud
Thibaud3mo ago
Tim, the error of "connection" between serverless and instance seems to be an API key error. i use a new one ... but I have found one other issue: "OpenAIRequest.request_chat_completions() got an unexpected keyword argument 'temperature' "OpenAIRequest.request_chat_completions() got an unexpected keyword argument 'stop'
NERDDISCO
NERDDISCO3mo ago
Ok, makes sense! How can we reproduce the error you are seeing?
nerdylive
nerdylive3mo ago
Are you using openai client? Yeah maybe send the code
Thibaud
Thibaud3mo ago
give me 1 minute, i clean my code
NERDDISCO
NERDDISCO3mo ago
AHH so when you use temperature or any of the other params, it will fail?
Thibaud
Thibaud3mo ago
yes
NERDDISCO
NERDDISCO3mo ago
ok awesome. I reported this to the team and based on their feedback this will either be fixed today or I will create an issue on GitHub to not lose focus. thanks for helping out!
Thibaud
Thibaud3mo ago
thanks! the big part was to find the issue, solve it will be faster.
NERDDISCO
NERDDISCO3mo ago
We appreciate your debugging skills very much ❤️ @Thibaud May I ask what kind of use case you have for SGLang?
Thibaud
Thibaud3mo ago
ofc, i'm launching (for now in beta) a saas where user can talk to AI character. vLLM works well but it slower than sglang, i don't find good (fast) settings to run 70B model with vLLM. With SGLang (at least on the pod, not serverless (i don't have data yet for serverless), it's a HUGE difference)
nerdylive
nerdylive3mo ago
Expect Lil 2-3x faster some docs says
NERDDISCO
NERDDISCO3mo ago
If you want to have a beta tester, then I would be happy to help!
Thibaud
Thibaud3mo ago
for now, we focus our beta in french, as soon as we launch in english yes!
nerdylive
nerdylive3mo ago
Ohh mb wrong read
Thibaud
Thibaud3mo ago
btw, i did a PR on github. it ll help your team @Tim aka NERDDISCO to continue about SGLang: I did a test - 2x h100 (2 GPUs / worker). - 10 concurrent users (each doing 3 request of 2000 tokens in / 200 out) Serverless: worker-sglang CONTEXT_LENGTH 8192 TENSOR_PARALLEL_SIZE 2 MODEL_PATH NousResearch/Hermes-3-Llama-3.1-70B-FP8 Average Inference time: 30s Pod: lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --context-length 8192 --host 0.0.0.0 --port 8000 --tp 2 Average Inference time: 3s (test done after model in vram ofc)
nerdylive
nerdylive3mo ago
How much faster than vllm in pod? Wow roughly 3times faster on the average inference time
NERDDISCO
NERDDISCO3mo ago
I just saw it, thank you VERY MUCH!!! oh, so the performance on serverless is actually very very bad compared pods
Thibaud
Thibaud3mo ago
yes. I think I found the cause of that the cuda version is not the same and flashinfer is outdated
NERDDISCO
NERDDISCO3mo ago
you mean in the image from RunPod?
Thibaud
Thibaud3mo ago
yes
nerdylive
nerdylive3mo ago
if you want you can make a pr to update that 👍 wait, doesn't the new runpod vllm or sglang builds in multiple cuda versions too
NERDDISCO
NERDDISCO3mo ago
The PR from @Thibaud already contains the correct CUDA version, I think we just need to make sure to also update flashinfer right?
nerdylive
nerdylive3mo ago
oh i haven't seen it yet, but yeah that might worth trying
Thibaud
Thibaud3mo ago
no my commit don't have the correct version i m trying in local
NERDDISCO
NERDDISCO3mo ago
Is this not the Dockerfile from the official image? https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile#L1
GitHub
sglang/docker/Dockerfile at main · sgl-project/sglang
SGLang is yet another fast serving framework for large language models and vision language models. - sgl-project/sglang
Thibaud
Thibaud3mo ago
this one. yes. but i don’t think I have made a PR with the correct yet? i don’t have built it / test it het (my bandwith is slow 🥲 and the docker build on github failed so i have to do it locally and upload)
Thibaud
Thibaud3mo ago
GitHub
Comparing runpod-workers:main...supa-thibaud:main · runpod-workers/...
SGLang is yet another fast serving framework for large language models and vision language models. - Comparing runpod-workers:main...supa-thibaud:main · runpod-workers/worker-sglang
Thibaud
Thibaud3mo ago
ok, this version works with updated cuda/flash infer. (i made a PR) but, this version is still slower (30s vs 3s) than the version on pod.
nerdylive
nerdylive3mo ago
maybe you want to test, whats making the difference, if its on versions you can try downgrading version on your pod too
Thibaud
Thibaud3mo ago
i'll try to compare the pip freeze of each
nerdylive
nerdylive3mo ago
nc or it might be other deps / package too thats affecting performance maybe from dockerfile, or other scripts
Thibaud
Thibaud3mo ago
can't say btw, if it's possible to have some credit to make tests faster (i stop/launch instance everytime to reduce cost but it's timeconsuming). i'm currently building a new docker with updated version of some dependencies.
nerdylive
nerdylive3mo ago
i'll launch what you need maybe ya ik, just saying the possibilities that may affect that is wide if its from deps, etc
Thibaud
Thibaud3mo ago
ok, i updated the DockerFile with more recent version of python. but no gain! my next hypothesis is this one: on serverless, the different request don't use the same instance (processus) of the SGlang server on pod, the server is launched like that: python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --context-length 8192 --host 0.0.0.0 --port 8000 --tp 2 and running, so when new request is launched, it's handled by this sglang server. on serverless, code is : # Initialize the engine engine = SGlangEngine() engine.start_server() engine.wait_for_server() async def async_handler(job): """Handle the requests asynchronously.""" job_input = job["input"] print(f"JOB_INPUT: {job_input}") if job_input.get("openai_route"): openai_route, openai_input = job_input.get("openai_route"), job_input.get("openai_input") openai_request = OpenAIRequest() if openai_route == "/v1/chat/completions": async for chunk in openai_request.request_chat_completions(**openai_input): yield chunk elif openai_route == "/v1/completions": async for chunk in openai_request.request_completions(**openai_input): yield chunk elif openai_route == "/v1/models": models = await openai_request.get_models() yield models else: generate_url = f"{engine.base_url}/generate" headers = {"Content-Type": "application/json"} generate_data = { "text": job_input.get("prompt", ""), "sampling_params": job_input.get("sampling_params", {}) } response = requests.post(generate_url, json=generate_data, headers=headers) if response.status_code == 200: yield response.json() else: yield {"error": f"Generate request failed with status code {response.status_code}", "details": response.text} runpod.serverless.start({"handler": async_handler, "return_aggregate_stream": True}) i just compare the handler for sglang and vLLM, one big difference is sglang one don't have concurrency_modifier param.
nerdylive
nerdylive3mo ago
Does the logs support that? Like a new sglang just started
Thibaud
Thibaud3mo ago
in the log, only one sgland seems to be started. ok, i know the reason now. but i don't have any idea how to solve it. it's too related to runpod/serverless architecture for me. maybe @Tim aka NERDDISCO could help. Serverless: "message":"[gpu=0] Decode batch. #running-req: 1, #token: 2303, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 " Finished running generator. 2024-08-23 10:50:19.804 [96xlptgpmheem3] [info] _client.py :1026 2024-08-23 01:50:19,803 HTTP Request: POST http://0.0.0.0:30000/v1/chat/completions "HTTP/1.1 200 OK" 2024-08-23 10:50:19.803 [96xlptgpmheem3] [info] INFO: 127.0.0.1:55604 - "POST /v1/chat/completions HTTP/1.1" 200 OK 2024-08-23 10:50:19.635 [96xlptgpmheem3] [info] [gpu=0] Decode batch. #running-req: 1, #token: 2294, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 2024-08-23 10:50:18.544 [96xlptgpmheem3] [info] [gpu=0] Decode batch. #running-req: 1, #token: 2254, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 2024-08-23 10:50:17.454 [96xlptgpmheem3] [info] [gpu=0] Decode batch. #running-req: 1, #token: 2214, token usage: 0.01, gen throughput (token/s): 16.40, #queue-req: 0 2024-08-23 10:50:17.273 [96xlptgpmheem3] [info] [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 95.59%, #running-req: 0, #queue-req: 0 Pod: 2024-08-23T08:53:25.799936753Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 74.92%, #running-req: 0, #queue-req: 0 2024-08-23T08:53:25.856204372Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 83.26%, #running-req: 1, #queue-req: 0 2024-08-23T08:53:26.118457662Z [gpu=0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 6624, cache hit rate: 88.82%, #running-req: 3, #queue-req: 0 2024-08-23T08:53:26.148959295Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 89.94%, #running-req: 6, #queue-req: 0 2024-08-23T08:53:26.173732167Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 90.85%, #running-req: 7, #queue-req: 0 2024-08-23T08:53:26.418726222Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 92.25%, #running-req: 8, #queue-req: 0 [gpu=0] Decode batch. #running-req: 10, #token: 2799, token usage: 0.01, gen throughput (token/s): 397.74, #queue-req: 0 on pod, request are handled in batch on serverless, one after the other. so it doesn't use its optimizatiosn to decode/encode in batch. it's completly useless if we don't find the correct setup.
NERDDISCO
NERDDISCO3mo ago
Thanks for pointing this out, I will check this out myself and see if we can somehow get around doing this. I remember that I heard something about this at some point, but I need to dig deeper
Thibaud
Thibaud3mo ago
thanks a lot. i don't think i can do much on my side i fix my repo. it's the version "working" with the most updated version (python, sglang, flashinfer) to reproduce at best the sglang docker used on pod. i think the issue is related to runpod serverless thing (and/or may be related to async/await). you can take a look at "my" version here: https://github.com/supa-thibaud/worker-sglang docker here: supathibaud/sglang-runpod-serverless
nerdylive
nerdylive3mo ago
Yeah if you reproduce everything and it's slow it's probably serverless
Thibaud
Thibaud3mo ago
yes. it's a "good" news. The issue is known now. => i found the issue (no batch but sequencial) but impossible for me to fix.
nerdylive
nerdylive3mo ago
ooh nice perhaps little comparing to the vllm-worker can fix that Oh you mean after adding the "concurrency" settings its still sequential?
Thibaud
Thibaud3mo ago
exactly
nerdylive
nerdylive3mo ago
Ic
Thibaud
Thibaud3mo ago
i tried that without success yet. if you find something, lmk
nerdylive
nerdylive3mo ago
yup i haven't tested it yet, thats before i saw your commit on the concurrency, i think its from runpod's side if you tested it
Thibaud
Thibaud3mo ago
i compare code between vllm and sglang and i don't see what can be negative on the sglang one
nerdylive
nerdylive3mo ago
hmm so this means its long in the queue right? not inference time
Thibaud
Thibaud3mo ago
yes
Thibaud
Thibaud3mo ago
when i launched i have queued 9 and Inprogress 1 never more than one in progress
No description
nerdylive
nerdylive3mo ago
ahh i thought it was on the inference time, all this time hahah yeah runpod needs to fix their queue to be faster on applying jobs/assigning
Thibaud
Thibaud3mo ago
but the issue is not here with vLLM
nerdylive
nerdylive3mo ago
yea
NERDDISCO
NERDDISCO3mo ago
@Thibaud so we were talking about this internally and we are already working on some things to make this happen. Would you mind opening a new issue on the worker-sglang repo with your findings ? Because we would love to keep track on where this came from. And you did a great job on debugging already and we would love to keep you in the loop on that.
GitHub
Build software better, together
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
From An unknown user
From An unknown user
From An unknown user
Thibaud
Thibaud3mo ago
ofc. I'll try to resume all of that done!
NERDDISCO
NERDDISCO3mo ago
Thank you very very much!!!
Thibaud
Thibaud3mo ago
i hope you'll find a way to solve the remaining issue
NERDDISCO
NERDDISCO3mo ago
that would be really nice indeed.
Thibaud
Thibaud3mo ago
i did some more test today without big success... just burning few dozens of bucks. coding with async seems a little helpful (i think i got 2 "inprogress" at the same time but never more even flooding with dozens of lite requests (openai) on 2 x H100s with 2GPU/worker with a 8B model) any news with your internal team?
NERDDISCO
NERDDISCO3mo ago
@Thibaud is still under investigation! Possible first result this week.
Thibaud
Thibaud2mo ago
ok, let me know! still nothing @Tim aka NERDDISCO ?
jackson
jackson2mo ago
hi guys, can i run multiple nodes on runpod? example: two nodes with 4 GPUs on each node
nerdylive
nerdylive2mo ago
Yes but, for private networking between them not right now
jackson
jackson2mo ago
i ran with command: /bin/bash -c "python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-8B --context-length 8192 --quantization fp8 --host 0.0.0.0 --port 8000 --tp 8 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0" and got error mesage: [W909 10:25:24.935506246 socket.cpp:697] [c10d] The IPv6 network addresses of (sgl-dev-0, 50000) cannot be retrieved (gai error: -2 - Name or service not known) run on community cloud
nerdylive
nerdylive2mo ago
Well it's not the correct IP is it
NERDDISCO
NERDDISCO2mo ago
We are totally still working on this, I made sure that you receive an update via the issue.
Want results from more Discord servers?
Add your server