Are there benchmarks available for llama 3.1 8b running on max?

@ModularBot I am trying to benchmark performance of max on cpu. Are there any available benchmarks? Or is there code which I can use to run the benchmarks myself. I want to run inference on llama 3.1 8b model.
8 Replies
Ehsan M. Kermani (Modular)
A lot was improved on the CPU front too since it was released a few months back for 24.5 (refer to https://www.modular.com/blog/max-24-5-with-sota-cpu-performance-for-llama-3-1) but the focus has been GPU for the recent 24.6 release https://www.modular.com/blog/max-gpu-state-of-the-art-throughput-on-a-new-genai-platform
Modular: MAX GPU: State of the Art Throughput on a New GenAI platform
Measuring state of the art GPU performance compared to vLLM on Modular's MAX 24.6
Sai Saurab Scorelabs
@Ehsan M. Kermani I am trying to run modular max on cpu using
magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct
magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct
But I am getting error
Quantization encodings are not supported in safetensor format. Got: QuantizationEncoding.Q4_K
Quantization encodings are not supported in safetensor format. Got: QuantizationEncoding.Q4_K
Could you please help me resolve this? Can i disable quantization?
Brad Larson
Brad Larson11h ago
@Sai Saurab Scorelabs For serving on CPU, I'd recommend using the q4_k quantized weights, which we have hosted on our Hugging Face repository and that you can serve using
magic run serve --huggingface-repo-id=modularai/llama-3.1
magic run serve --huggingface-repo-id=modularai/llama-3.1
The bfloat16 weights hosted at the main meta-llama/Llama-3.1-8B-Instruct repository are intended for running on GPU, and if you do want to serve those on GPU you can use
magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct --use-gpu --quantization-encoding bfloat16
magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct --use-gpu --quantization-encoding bfloat16
Sai Saurab Scorelabs
@Brad Larson But I want to benchmark performance on unquantized weights? Is there a way to do that on cpu? I am able to run bfloat16 weights using ipex on cpu. Is it not possible to do that using max?
Brad Larson
Brad Larson9h ago
In that case, you can remove the --use-gpu flag from the second invocation I list above and it'll use the bfloat16 weights. I'll caution that the bfloat16 calculations on CPU are incompatible with ARM processors, and can only run on X86_64 systems. We also haven't optimized for bfloat16 on CPU, we've targeted GPUs for that datatype and quantized datatypes for CPUs.
ModularBot
ModularBot9h ago
Congrats @Brad Larson, you just advanced to level 3!
Sai Saurab Scorelabs
@Brad Larson Thanks a lot. That worked. Are there any other parameters that I can configure? How many cpu cores to use, how much memory to allocate etc?
Brad Larson
Brad Larson3h ago
There are a couple of parameters you can use to impact serving performance, at a tradeoff of memory, such as --max-cache-batch-size which will set the max number of simultaneous batched requests that can be processed, and --max-length which sets the context window size. A full readout of available parameters can be seen via magic run serve --help.
Want results from more Discord servers?
Add your server