Settings to reduce delay time using sglang for 4bit quantized models?
I'm deploying 4bit AWQ quantized model: casperhansen/llama-3.3-70b-instruct-awq
The delay time for parallel requests increases exponentially when using tsglang template. What settings I need to use to make sure the delay time is manageable?
1 Reply
which git repo or template are you using? can you shre link?