SGLang
SGLang works very well in pod but impossible to run in serverless.
the api route stay => error 404
i use the exact same config (docker, command line, port) in pod and serverless.
85 Replies
GitHub
GitHub - runpod-workers/worker-sglang: SGLang is yet another fast s...
SGLang is yet another fast serving framework for large language models and vision language models. - runpod-workers/worker-sglang
yes. but it launch but using request on runpod ui or openai call give zero results
Would you mind sending me a screenshot of a deployed worker with the request + result? Then I will open an issue on our repo and get an engineer looking at the problem.
some screen, i hope it can help.
i don't have any results. just nothing happens.
maybe i have miss something on my config of the serverless.
do you need something else?
Nope this looks fine, thank you very much!
thanks.
i hope your team will find a solution or a tutorial if the error is between the keyboard and the chair.
any news about that?
nope not yet, sorry! Will keep you updated once we have something 🙏
Hey @Thibaud, we just released a preview of the worker, would you mind testing this out to see if this has the same problems? https://hub.docker.com/r/runpod/worker-sglang/tags
of course. I'll do that right now
awesome, thank you very much!!!
so it's almost ok !
but maybe the last issue i have is a bad configuration.
tldr; i can't ping the cluster only some instance
My configuration:
container Image:
runpod/worker-sglang:preview-cuda12.1.0
Container Start Command:
python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-8B --context-length 8192 --host 0.0.0.0 --port 8000
Expose HTTP Ports: 8000,
once launched, if i click on the running instance, click connect web and use the url (something like: https://xxx-8000.proxy.runpod.net/v1) it works.
but i don't have something like
OPENAI BASE URL
https://api.runpod.ai/v2/vllm-xxxx/openai/v1 to ping the cluster
what if you build the url yourself
https://api.runpod.ai/v2/your-endpoint-id/openai/v1
no.
and none of the url works: error 401 each time.
i think the endpoint is not correctly "connected" to their instance
401 is unauthorized tho
does the /health also return 401?
maybe check if your api key if valid, and create new endpoint
@Thibaud thank you very much for testing this in depth, I will report this back to the team
Tim, the error of "connection" between serverless and instance seems to be an API key error. i use a new one ...
but I have found one other issue:
"OpenAIRequest.request_chat_completions() got an unexpected keyword argument 'temperature'
"OpenAIRequest.request_chat_completions() got an unexpected keyword argument 'stop'
Ok, makes sense!
How can we reproduce the error you are seeing?
Are you using openai client?
Yeah maybe send the code
give me 1 minute, i clean my code
the issue is here:
https://github.com/runpod-workers/worker-sglang/blob/32a669858d61f1b80dc79195e4d6b61d656b4241/src/engine.py#L103
not enough parameters.
AHH so when you use
temperature
or any of the other params, it will fail?yes
ok awesome. I reported this to the team and based on their feedback this will either be fixed today or I will create an issue on GitHub to not lose focus.
thanks for helping out!
thanks!
the big part was to find the issue, solve it will be faster.
We appreciate your debugging skills very much ❤️
@Thibaud May I ask what kind of use case you have for SGLang?
ofc, i'm launching (for now in beta) a saas where user can talk to AI character.
vLLM works well but it slower than sglang, i don't find good (fast) settings to run 70B model with vLLM. With SGLang (at least on the pod, not serverless (i don't have data yet for serverless), it's a HUGE difference)
Expect Lil 2-3x faster some docs says
If you want to have a beta tester, then I would be happy to help!
for now, we focus our beta in french, as soon as we launch in english yes!
Ohh mb wrong read
btw, i did a PR on github.
it ll help your team
@Tim aka NERDDISCO to continue about SGLang:
I did a test
- 2x h100 (2 GPUs / worker).
- 10 concurrent users (each doing 3 request of 2000 tokens in / 200 out)
Serverless:
worker-sglang
CONTEXT_LENGTH 8192
TENSOR_PARALLEL_SIZE 2
MODEL_PATH NousResearch/Hermes-3-Llama-3.1-70B-FP8
Average Inference time: 30s
Pod:
lmsysorg/sglang:latest
python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --context-length 8192 --host 0.0.0.0 --port 8000 --tp 2
Average Inference time: 3s
(test done after model in vram ofc)
How much faster than vllm in pod?
Wow roughly 3times faster on the average inference time
I just saw it, thank you VERY MUCH!!!
oh, so the performance on serverless is actually very very bad compared pods
yes. I think I found the cause of that
the cuda version is not the same
and flashinfer is outdated
you mean in the image from RunPod?
yes
if you want you can make a pr to update that 👍
wait, doesn't the new runpod vllm or sglang builds in multiple cuda versions too
The PR from @Thibaud already contains the correct CUDA version, I think we just need to make sure to also update flashinfer right?
oh i haven't seen it yet, but yeah that might worth trying
no my commit don't have the correct version
i m trying in local
Is this not the Dockerfile from the official image? https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile#L1
GitHub
sglang/docker/Dockerfile at main · sgl-project/sglang
SGLang is yet another fast serving framework for large language models and vision language models. - sgl-project/sglang
this one. yes.
but i don’t think I have made a PR with the correct yet?
i don’t have built it / test it het (my bandwith is slow 🥲 and the docker build on github failed so i have to do it locally and upload)
but if you can test it before me : https://github.com/runpod-workers/worker-sglang/compare/main...supa-thibaud:worker-sglang:main here s my latest version
GitHub
Comparing runpod-workers:main...supa-thibaud:main · runpod-workers/...
SGLang is yet another fast serving framework for large language models and vision language models. - Comparing runpod-workers:main...supa-thibaud:main · runpod-workers/worker-sglang
ok, this version works with updated cuda/flash infer. (i made a PR)
but, this version is still slower (30s vs 3s) than the version on pod.
maybe you want to test, whats making the difference, if its on versions you can try downgrading version on your pod too
i'll try to compare the pip freeze of each
nc
or it might be other deps / package too thats affecting performance maybe from dockerfile, or other scripts
can't say
btw, if it's possible to have some credit to make tests faster (i stop/launch instance everytime to reduce cost but it's timeconsuming).
i'm currently building a new docker with updated version of some dependencies.
i'll launch what you need maybe
ya ik, just saying the possibilities that may affect that is wide if its from deps, etc
ok, i updated the DockerFile with more recent version of python.
but no gain!
my next hypothesis is this one:
on serverless, the different request don't use the same instance (processus) of the SGlang server
on pod, the server is launched like that:
python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --context-length 8192 --host 0.0.0.0 --port 8000 --tp 2
and running, so when new request is launched, it's handled by this sglang server.
on serverless, code is :
# Initialize the engine
engine = SGlangEngine()
engine.start_server()
engine.wait_for_server()
async def async_handler(job):
"""Handle the requests asynchronously."""
job_input = job["input"]
print(f"JOB_INPUT: {job_input}")
if job_input.get("openai_route"):
openai_route, openai_input = job_input.get("openai_route"), job_input.get("openai_input")
openai_request = OpenAIRequest()
if openai_route == "/v1/chat/completions":
async for chunk in openai_request.request_chat_completions(**openai_input):
yield chunk
elif openai_route == "/v1/completions":
async for chunk in openai_request.request_completions(**openai_input):
yield chunk
elif openai_route == "/v1/models":
models = await openai_request.get_models()
yield models
else:
generate_url = f"{engine.base_url}/generate"
headers = {"Content-Type": "application/json"}
generate_data = {
"text": job_input.get("prompt", ""),
"sampling_params": job_input.get("sampling_params", {})
}
response = requests.post(generate_url, json=generate_data, headers=headers)
if response.status_code == 200:
yield response.json()
else:
yield {"error": f"Generate request failed with status code {response.status_code}", "details": response.text}
runpod.serverless.start({"handler": async_handler, "return_aggregate_stream": True})
i just compare the handler for sglang and vLLM, one big difference is sglang one don't have concurrency_modifier param.Does the logs support that? Like a new sglang just started
in the log, only one sgland seems to be started.
ok, i know the reason now.
but i don't have any idea how to solve it. it's too related to runpod/serverless architecture for me.
maybe @Tim aka NERDDISCO could help.
Serverless:
"message":"[gpu=0] Decode batch. #running-req: 1, #token: 2303, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 "
Finished running generator.
2024-08-23 10:50:19.804
[96xlptgpmheem3]
[info]
_client.py :1026 2024-08-23 01:50:19,803 HTTP Request: POST http://0.0.0.0:30000/v1/chat/completions "HTTP/1.1 200 OK"
2024-08-23 10:50:19.803
[96xlptgpmheem3]
[info]
INFO: 127.0.0.1:55604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-08-23 10:50:19.635
[96xlptgpmheem3]
[info]
[gpu=0] Decode batch. #running-req: 1, #token: 2294, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0
2024-08-23 10:50:18.544
[96xlptgpmheem3]
[info]
[gpu=0] Decode batch. #running-req: 1, #token: 2254, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0
2024-08-23 10:50:17.454
[96xlptgpmheem3]
[info]
[gpu=0] Decode batch. #running-req: 1, #token: 2214, token usage: 0.01, gen throughput (token/s): 16.40, #queue-req: 0
2024-08-23 10:50:17.273
[96xlptgpmheem3]
[info]
[gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 95.59%, #running-req: 0, #queue-req: 0
Pod:
2024-08-23T08:53:25.799936753Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 74.92%, #running-req: 0, #queue-req: 0
2024-08-23T08:53:25.856204372Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 83.26%, #running-req: 1, #queue-req: 0
2024-08-23T08:53:26.118457662Z [gpu=0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 6624, cache hit rate: 88.82%, #running-req: 3, #queue-req: 0
2024-08-23T08:53:26.148959295Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 89.94%, #running-req: 6, #queue-req: 0
2024-08-23T08:53:26.173732167Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 90.85%, #running-req: 7, #queue-req: 0
2024-08-23T08:53:26.418726222Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 92.25%, #running-req: 8, #queue-req: 0
[gpu=0] Decode batch. #running-req: 10, #token: 2799, token usage: 0.01, gen throughput (token/s): 397.74, #queue-req: 0
on pod, request are handled in batch
on serverless, one after the other.
so it doesn't use its optimizatiosn to decode/encode in batch. it's completly useless if we don't find the correct setup.Thanks for pointing this out, I will check this out myself and see if we can somehow get around doing this. I remember that I heard something about this at some point, but I need to dig deeper
thanks a lot.
i don't think i can do much on my side
i fix my repo.
it's the version "working" with the most updated version (python, sglang, flashinfer) to reproduce at best the sglang docker used on pod.
i think the issue is related to runpod serverless thing (and/or may be related to async/await).
you can take a look at "my" version here:
https://github.com/supa-thibaud/worker-sglang
docker here:
supathibaud/sglang-runpod-serverless
Yeah if you reproduce everything and it's slow it's probably serverless
yes. it's a "good" news. The issue is known now.
=> i found the issue (no batch but sequencial) but impossible for me to fix.
ooh nice
perhaps little comparing to the vllm-worker can fix that
Oh you mean after adding the "concurrency" settings its still sequential?
exactly
Ic
i tried that without success yet.
if you find something, lmk
yup i haven't tested it yet, thats before i saw your commit on the concurrency, i think its from runpod's side if you tested it
i compare code between vllm and sglang and i don't see what can be negative on the sglang one
hmm so this means its long in the queue right? not inference time
yes
when i launched i have
queued 9 and Inprogress 1
never more than one in progress
ahh i thought it was on the inference time, all this time hahah
yeah runpod needs to fix their queue to be faster on applying jobs/assigning
but the issue is not here with vLLM
yea
@Thibaud so we were talking about this internally and we are already working on some things to make this happen. Would you mind opening a new issue on the worker-sglang repo with your findings ? Because we would love to keep track on where this came from. And you did a great job on debugging already and we would love to keep you in the loop on that.
GitHub
Build software better, together
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
ofc. I'll try to resume all of that
done!
Thank you very very much!!!
i hope you'll find a way to solve the remaining issue
that would be really nice indeed.
i did some more test today without big success... just burning few dozens of bucks.
coding with async seems a little helpful (i think i got 2 "inprogress" at the same time but never more even flooding with dozens of lite requests (openai) on 2 x H100s with 2GPU/worker with a 8B model)
any news with your internal team?
@Thibaud is still under investigation! Possible first result this week.
ok, let me know!
still nothing @Tim aka NERDDISCO ?
hi guys, can i run multiple nodes on runpod? example: two nodes with 4 GPUs on each node
Yes but, for private networking between them not right now
i ran with command:
/bin/bash -c "python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-8B --context-length 8192 --quantization fp8 --host 0.0.0.0 --port 8000 --tp 8 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0"
and got error mesage: [W909 10:25:24.935506246 socket.cpp:697] [c10d] The IPv6 network addresses of (sgl-dev-0, 50000) cannot be retrieved (gai error: -2 - Name or service not known)
run on community cloudWell it's not the correct IP is it
We are totally still working on this, I made sure that you receive an update via the issue.