RunPod•4mo ago

Settings to reduce delay time using sglang for 4bit quantized models?

I'm deploying 4bit AWQ quantized model: casperhansen/llama-3.3-70b-instruct-awq The delay time for parallel requests increases exponentially when using tsglang template. What settings I need to use to make sure the delay time is manageable?

1 Reply

flash-singh•4mo ago

which git repo or template are you using? can you shre link?

Gaming

Programming

Settings to reduce delay time using sglang for 4bit quantized models?

Did you find this page helpful?