RunPod•8mo ago

SGLang

SGLang works very well in pod but impossible to run in serverless. the api route stay => error 404 i use the exact same config (docker, command line, port) in pod and serverless.

85 Replies

NERDDISCO•8mo ago

@Thibaud have you tried https://github.com/runpod-workers/worker-sglang by any chance?

GitHub

GitHub - runpod-workers/worker-sglang: SGLang is yet another fast s...

SGLang is yet another fast serving framework for large language models and vision language models. - runpod-workers/worker-sglang

ThibaudOP•8mo ago

yes. but it launch but using request on runpod ui or openai call give zero results

NERDDISCO•8mo ago

Would you mind sending me a screenshot of a deployed worker with the request + result? Then I will open an issue on our repo and get an engineer looking at the problem.

ThibaudOP•8mo ago

some screen, i hope it can help.

ThibaudOP•8mo ago

i don't have any results. just nothing happens. maybe i have miss something on my config of the serverless. do you need something else?

NERDDISCO•8mo ago

Nope this looks fine, thank you very much!

ThibaudOP•8mo ago

thanks. i hope your team will find a solution or a tutorial if the error is between the keyboard and the chair. any news about that?

NERDDISCO•8mo ago

nope not yet, sorry! Will keep you updated once we have something 🙏

NERDDISCO•8mo ago

Hey @Thibaud, we just released a preview of the worker, would you mind testing this out to see if this has the same problems? https://hub.docker.com/r/runpod/worker-sglang/tags

ThibaudOP•8mo ago

of course. I'll do that right now

NERDDISCO•8mo ago

awesome, thank you very much!!!

ThibaudOP•8mo ago

so it's almost ok ! but maybe the last issue i have is a bad configuration. tldr; i can't ping the cluster only some instance My configuration: container Image: runpod/worker-sglang:preview-cuda12.1.0 Container Start Command: python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-8B --context-length 8192 --host 0.0.0.0 --port 8000 Expose HTTP Ports: 8000, once launched, if i click on the running instance, click connect web and use the url (something like: https://xxx-8000.proxy.runpod.net/v1) it works. but i don't have something like OPENAI BASE URL https://api.runpod.ai/v2/vllm-xxxx/openai/v1 to ping the cluster

Jason•8mo ago

what if you build the url yourself https://api.runpod.ai/v2/your-endpoint-id/openai/v1

ThibaudOP•8mo ago

no. and none of the url works: error 401 each time.

ThibaudOP•8mo ago

i think the endpoint is not correctly "connected" to their instance

Jason•8mo ago

401 is unauthorized tho does the /health also return 401? maybe check if your api key if valid, and create new endpoint

NERDDISCO•8mo ago

@Thibaud thank you very much for testing this in depth, I will report this back to the team

ThibaudOP•8mo ago

Tim, the error of "connection" between serverless and instance seems to be an API key error. i use a new one ... but I have found one other issue: "OpenAIRequest.request_chat_completions() got an unexpected keyword argument 'temperature' "OpenAIRequest.request_chat_completions() got an unexpected keyword argument 'stop'

NERDDISCO•8mo ago

Ok, makes sense! How can we reproduce the error you are seeing?

Jason•8mo ago

Are you using openai client? Yeah maybe send the code

ThibaudOP•8mo ago

give me 1 minute, i clean my code

Jason•8mo ago

https://tenor.com/view/tom-jerry-cleaning-house-tomjerry-gif-19255418

Tenor

ThibaudOP•8mo ago

runpod_example.py

ThibaudOP•8mo ago

the issue is here: https://github.com/runpod-workers/worker-sglang/blob/32a669858d61f1b80dc79195e4d6b61d656b4241/src/engine.py#L103 not enough parameters.

NERDDISCO•8mo ago

AHH so when you use temperature or any of the other params, it will fail?

ThibaudOP•8mo ago

yes

NERDDISCO•8mo ago

ok awesome. I reported this to the team and based on their feedback this will either be fixed today or I will create an issue on GitHub to not lose focus. thanks for helping out!

ThibaudOP•8mo ago

thanks! the big part was to find the issue, solve it will be faster.

NERDDISCO•8mo ago

We appreciate your debugging skills very much ❤️ @Thibaud May I ask what kind of use case you have for SGLang?

ThibaudOP•8mo ago

ofc, i'm launching (for now in beta) a saas where user can talk to AI character. vLLM works well but it slower than sglang, i don't find good (fast) settings to run 70B model with vLLM. With SGLang (at least on the pod, not serverless (i don't have data yet for serverless), it's a HUGE difference)

Jason•8mo ago

Expect Lil 2-3x faster some docs says

NERDDISCO•8mo ago

If you want to have a beta tester, then I would be happy to help!

ThibaudOP•8mo ago

for now, we focus our beta in french, as soon as we launch in english yes!

Jason•8mo ago

Ohh mb wrong read

ThibaudOP•8mo ago

btw, i did a PR on github. it ll help your team @Tim aka NERDDISCO to continue about SGLang: I did a test - 2x h100 (2 GPUs / worker). - 10 concurrent users (each doing 3 request of 2000 tokens in / 200 out) Serverless: worker-sglang CONTEXT_LENGTH 8192 TENSOR_PARALLEL_SIZE 2 MODEL_PATH NousResearch/Hermes-3-Llama-3.1-70B-FP8 Average Inference time: 30s Pod: lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --context-length 8192 --host 0.0.0.0 --port 8000 --tp 2 Average Inference time: 3s (test done after model in vram ofc)

Jason•8mo ago

How much faster than vllm in pod? Wow roughly 3times faster on the average inference time

NERDDISCO•8mo ago

I just saw it, thank you VERY MUCH!!! oh, so the performance on serverless is actually very very bad compared pods

ThibaudOP•8mo ago

yes. I think I found the cause of that the cuda version is not the same and flashinfer is outdated

NERDDISCO•8mo ago

you mean in the image from RunPod?

ThibaudOP•8mo ago

yes

Jason•8mo ago

if you want you can make a pr to update that 👍 wait, doesn't the new runpod vllm or sglang builds in multiple cuda versions too

NERDDISCO•8mo ago

The PR from @Thibaud already contains the correct CUDA version, I think we just need to make sure to also update flashinfer right?

Jason•8mo ago

oh i haven't seen it yet, but yeah that might worth trying

ThibaudOP•8mo ago

no my commit don't have the correct version i m trying in local

NERDDISCO•8mo ago

Is this not the Dockerfile from the official image? https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile#L1

GitHub

sglang/docker/Dockerfile at main · sgl-project/sglang

SGLang is yet another fast serving framework for large language models and vision language models. - sgl-project/sglang

ThibaudOP•8mo ago

this one. yes. but i don’t think I have made a PR with the correct yet? i don’t have built it / test it het (my bandwith is slow 🥲 and the docker build on github failed so i have to do it locally and upload)

ThibaudOP•8mo ago

but if you can test it before me : https://github.com/runpod-workers/worker-sglang/compare/main...supa-thibaud:worker-sglang:main here s my latest version

GitHub

Comparing runpod-workers:main...supa-thibaud:main · runpod-workers/...

SGLang is yet another fast serving framework for large language models and vision language models. - Comparing runpod-workers:main...supa-thibaud:main · runpod-workers/worker-sglang

ThibaudOP•8mo ago

ok, this version works with updated cuda/flash infer. (i made a PR) but, this version is still slower (30s vs 3s) than the version on pod.

Jason•8mo ago

maybe you want to test, whats making the difference, if its on versions you can try downgrading version on your pod too

ThibaudOP•8mo ago

i'll try to compare the pip freeze of each

Jason•8mo ago

nc or it might be other deps / package too thats affecting performance maybe from dockerfile, or other scripts

ThibaudOP•8mo ago

can't say btw, if it's possible to have some credit to make tests faster (i stop/launch instance everytime to reduce cost but it's timeconsuming). i'm currently building a new docker with updated version of some dependencies.

Jason•8mo ago

i'll launch what you need maybe ya ik, just saying the possibilities that may affect that is wide if its from deps, etc

ThibaudOP•8mo ago

ok, i updated the DockerFile with more recent version of python. but no gain! my next hypothesis is this one: on serverless, the different request don't use the same instance (processus) of the SGlang server on pod, the server is launched like that:

python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-70B-FP8 --context-length 8192 --host 0.0.0.0 --port 8000 --tp 2

and running, so when new request is launched, it's handled by this sglang server. on serverless, code is :


# Initialize the engine
engine = SGlangEngine()
engine.start_server()
engine.wait_for_server()

async def async_handler(job):
    """Handle the requests asynchronously."""
    job_input = job["input"]
    print(f"JOB_INPUT: {job_input}")
    
    if job_input.get("openai_route"):
        openai_route, openai_input = job_input.get("openai_route"), job_input.get("openai_input")
        openai_request = OpenAIRequest()
        
        if openai_route == "/v1/chat/completions":
            async for chunk in openai_request.request_chat_completions(**openai_input):
                yield chunk
        elif openai_route == "/v1/completions":
            async for chunk in openai_request.request_completions(**openai_input):
                yield chunk
        elif openai_route == "/v1/models":
            models = await openai_request.get_models()
            yield models
    else:
        generate_url = f"{engine.base_url}/generate"
        headers = {"Content-Type": "application/json"}
        generate_data = {
            "text": job_input.get("prompt", ""),
            "sampling_params": job_input.get("sampling_params", {})
        }
        response = requests.post(generate_url, json=generate_data, headers=headers)
        if response.status_code == 200:
            yield response.json()
        else:
            yield {"error": f"Generate request failed with status code {response.status_code}", "details": response.text}

runpod.serverless.start({"handler": async_handler, "return_aggregate_stream": True})

i just compare the handler for sglang and vLLM, one big difference is sglang one don't have concurrency_modifier param.

Jason•8mo ago

Does the logs support that? Like a new sglang just started

ThibaudOP•8mo ago

in the log, only one sgland seems to be started. ok, i know the reason now. but i don't have any idea how to solve it. it's too related to runpod/serverless architecture for me. maybe @Tim aka NERDDISCO could help. Serverless:

"message":"[gpu=0] Decode batch. #running-req: 1, #token: 2303, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0 "
Finished running generator.
2024-08-23 10:50:19.804
[96xlptgpmheem3]
[info]
_client.py :1026 2024-08-23 01:50:19,803 HTTP Request: POST http://0.0.0.0:30000/v1/chat/completions "HTTP/1.1 200 OK"
2024-08-23 10:50:19.803
[96xlptgpmheem3]
[info]
INFO: 127.0.0.1:55604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-08-23 10:50:19.635
[96xlptgpmheem3]
[info]
[gpu=0] Decode batch. #running-req: 1, #token: 2294, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0
2024-08-23 10:50:18.544
[96xlptgpmheem3]
[info]
[gpu=0] Decode batch. #running-req: 1, #token: 2254, token usage: 0.01, gen throughput (token/s): 36.67, #queue-req: 0
2024-08-23 10:50:17.454
[96xlptgpmheem3]
[info]
[gpu=0] Decode batch. #running-req: 1, #token: 2214, token usage: 0.01, gen throughput (token/s): 16.40, #queue-req: 0
2024-08-23 10:50:17.273
[96xlptgpmheem3]
[info]
[gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 95.59%, #running-req: 0, #queue-req: 0

Pod:

2024-08-23T08:53:25.799936753Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 74.92%, #running-req: 0, #queue-req: 0
2024-08-23T08:53:25.856204372Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 83.26%, #running-req: 1, #queue-req: 0
2024-08-23T08:53:26.118457662Z [gpu=0] Prefill batch. #new-seq: 3, #new-token: 3, #cached-token: 6624, cache hit rate: 88.82%, #running-req: 3, #queue-req: 0
2024-08-23T08:53:26.148959295Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 89.94%, #running-req: 6, #queue-req: 0
2024-08-23T08:53:26.173732167Z [gpu=0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 2208, cache hit rate: 90.85%, #running-req: 7, #queue-req: 0
2024-08-23T08:53:26.418726222Z [gpu=0] Prefill batch. #new-seq: 2, #new-token: 2, #cached-token: 4416, cache hit rate: 92.25%, #running-req: 8, #queue-req: 0

[gpu=0] Decode batch. #running-req: 10, #token: 2799, token usage: 0.01, gen throughput (token/s): 397.74, #queue-req: 0

on pod, request are handled in batch on serverless, one after the other. so it doesn't use its optimizatiosn to decode/encode in batch. it's completly useless if we don't find the correct setup.

NERDDISCO•8mo ago

Thanks for pointing this out, I will check this out myself and see if we can somehow get around doing this. I remember that I heard something about this at some point, but I need to dig deeper

ThibaudOP•8mo ago

thanks a lot. i don't think i can do much on my side i fix my repo. it's the version "working" with the most updated version (python, sglang, flashinfer) to reproduce at best the sglang docker used on pod. i think the issue is related to runpod serverless thing (and/or may be related to async/await). you can take a look at "my" version here: https://github.com/supa-thibaud/worker-sglang docker here: supathibaud/sglang-runpod-serverless

Jason•8mo ago

Yeah if you reproduce everything and it's slow it's probably serverless

ThibaudOP•8mo ago

yes. it's a "good" news. The issue is known now. => i found the issue (no batch but sequencial) but impossible for me to fix.

Jason•8mo ago

ooh nice perhaps little comparing to the vllm-worker can fix that Oh you mean after adding the "concurrency" settings its still sequential?

ThibaudOP•8mo ago

exactly

Jason•8mo ago

ThibaudOP•8mo ago

i tried that without success yet. if you find something, lmk

Jason•8mo ago

yup i haven't tested it yet, thats before i saw your commit on the concurrency, i think its from runpod's side if you tested it

ThibaudOP•8mo ago

i compare code between vllm and sglang and i don't see what can be negative on the sglang one

Jason•8mo ago

hmm so this means its long in the queue right? not inference time

ThibaudOP•8mo ago

yes

ThibaudOP•8mo ago

when i launched i have queued 9 and Inprogress 1 never more than one in progress

Jason•8mo ago

ahh i thought it was on the inference time, all this time hahah yeah runpod needs to fix their queue to be faster on applying jobs/assigning

ThibaudOP•8mo ago

but the issue is not here with vLLM

Jason•8mo ago

yea

NERDDISCO•8mo ago

@Thibaud so we were talking about this internally and we are already working on some things to make this happen. Would you mind opening a new issue on the worker-sglang repo with your findings ? Because we would love to keep track on where this came from. And you did a great job on debugging already and we would love to keep you in the loop on that.

GitHub

Build software better, together

GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

ThibaudOP•8mo ago

ofc. I'll try to resume all of that done!

NERDDISCO•8mo ago

Thank you very very much!!!

ThibaudOP•8mo ago

i hope you'll find a way to solve the remaining issue

NERDDISCO•8mo ago

that would be really nice indeed.

ThibaudOP•8mo ago

i did some more test today without big success... just burning few dozens of bucks. coding with async seems a little helpful (i think i got 2 "inprogress" at the same time but never more even flooding with dozens of lite requests (openai) on 2 x H100s with 2GPU/worker with a 8B model) any news with your internal team?

NERDDISCO•8mo ago

@Thibaud is still under investigation! Possible first result this week.

ThibaudOP•8mo ago

ok, let me know! still nothing @Tim aka NERDDISCO ?

jackson•8mo ago

hi guys, can i run multiple nodes on runpod? example: two nodes with 4 GPUs on each node

Jason•8mo ago

Yes but, for private networking between them not right now

jackson•8mo ago

i ran with command:

/bin/bash -c "python3 -m sglang.launch_server --model-path NousResearch/Hermes-3-Llama-3.1-8B --context-length 8192  --quantization fp8 --host 0.0.0.0 --port 8000 --tp 8 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0"

and got error mesage:

[W909 10:25:24.935506246 socket.cpp:697] [c10d] The IPv6 network addresses of (sgl-dev-0, 50000) cannot be retrieved (gai error: -2 - Name or service not known)

run on community cloud

Jason•8mo ago

Well it's not the correct IP is it

NERDDISCO•8mo ago

We are totally still working on this, I made sure that you receive an update via the issue.

Gaming

Programming

SGLang

Did you find this page helpful?