Modular•5mo ago

Are there benchmarks available for llama 3.1 8b running on max?

@ModularBot I am trying to benchmark performance of max on cpu. Are there any available benchmarks? Or is there code which I can use to run the benchmarks myself. I want to run inference on llama 3.1 8b model.

8 Replies

Ehsan M. Kermani (Modular)•5mo ago

A lot was improved on the CPU front too since it was released a few months back for 24.5 (refer to https://www.modular.com/blog/max-24-5-with-sota-cpu-performance-for-llama-3-1) but the focus has been GPU for the recent 24.6 release https://www.modular.com/blog/max-gpu-state-of-the-art-throughput-on-a-new-genai-platform

Modular: MAX GPU: State of the Art Throughput on a New GenAI platform

Measuring state of the art GPU performance compared to vLLM on Modular's MAX 24.6

Sai Saurab ScorelabsOP•5mo ago

@Ehsan M. Kermani I am trying to run modular max on cpu using

magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct

magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct

But I am getting error

Quantization encodings are not supported in safetensor format. Got: QuantizationEncoding.Q4_K

Quantization encodings are not supported in safetensor format. Got: QuantizationEncoding.Q4_K

Could you please help me resolve this? Can i disable quantization?

Brad Larson•5mo ago

@Sai Saurab Scorelabs For serving on CPU, I'd recommend using the q4_k quantized weights, which we have hosted on our Hugging Face repository and that you can serve using

magic run serve --huggingface-repo-id=modularai/llama-3.1

magic run serve --huggingface-repo-id=modularai/llama-3.1

The bfloat16 weights hosted at the main meta-llama/Llama-3.1-8B-Instruct repository are intended for running on GPU, and if you do want to serve those on GPU you can use

magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct --use-gpu --quantization-encoding bfloat16

magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct --use-gpu --quantization-encoding bfloat16

Sai Saurab ScorelabsOP•5mo ago

@Brad Larson But I want to benchmark performance on unquantized weights? Is there a way to do that on cpu? I am able to run bfloat16 weights using ipex on cpu. Is it not possible to do that using max?

Brad Larson•5mo ago

In that case, you can remove the --use-gpu flag from the second invocation I list above and it'll use the bfloat16 weights. I'll caution that the bfloat16 calculations on CPU are incompatible with ARM processors, and can only run on X86_64 systems. We also haven't optimized for bfloat16 on CPU, we've targeted GPUs for that datatype and quantized datatypes for CPUs.

ModularBot•5mo ago

Congrats @Brad Larson, you just advanced to level 3!

Sai Saurab ScorelabsOP•5mo ago

@Brad Larson Thanks a lot. That worked. Are there any other parameters that I can configure? How many cpu cores to use, how much memory to allocate etc?

Brad Larson•5mo ago

There are a couple of parameters you can use to impact serving performance, at a tradeoff of memory, such as --max-cache-batch-size which will set the max number of simultaneous batched requests that can be processed, and --max-length which sets the context window size. A full readout of available parameters can be seen via magic run serve --help.

Gaming

Programming

Are there benchmarks available for llama 3.1 8b running on max?

Did you find this page helpful?