RunPod•5mo ago

Not getting 100s of req/sec serving for Llama 3 70B models with default vLLM serverless template

I'm deploying Llama-70B models without quantization using 2x80GB workers but after 10 parallel requests the execution and delay time increases to 10-50sec. I'm not sure if I'm doing something wrong with my setup. I pretty much use the default setup with the vLLM template just setting MAX_MODEL_LEN to 4096 and ENFORCE_EAGER to true

0 Replies

No replies yetBe the first to reply to this messageJoin

Gaming

Programming

Not getting 100s of req/sec serving for Llama 3 70B models with default vLLM serverless template

Did you find this page helpful?