Not getting 100s of req/sec serving for Llama 3 70B models with default vLLM serverless template
I'm deploying Llama-70B models without quantization using 2x80GB workers but after 10 parallel requests the execution and delay time increases to 10-50sec. I'm not sure if I'm doing something wrong with my setup. I pretty much use the default setup with the vLLM template just setting MAX_MODEL_LEN to 4096 and ENFORCE_EAGER to true
0 Replies