Are there benchmarks available for llama 3.1 8b running on max?
@ModularBot I am trying to benchmark performance of max on cpu. Are there any available benchmarks? Or is there code which I can use to run the benchmarks myself. I want to run inference on llama 3.1 8b model.
8 Replies
A lot was improved on the CPU front too since it was released a few months back for 24.5 (refer to https://www.modular.com/blog/max-24-5-with-sota-cpu-performance-for-llama-3-1) but the focus has been GPU for the recent 24.6 release https://www.modular.com/blog/max-gpu-state-of-the-art-throughput-on-a-new-genai-platform
Modular: MAX GPU: State of the Art Throughput on a New GenAI platform
Measuring state of the art GPU performance compared to vLLM on Modular's MAX 24.6
@Ehsan M. Kermani I am trying to run modular max on cpu using
But I am getting error
Could you please help me resolve this?
Can i disable quantization?
@Sai Saurab Scorelabs For serving on CPU, I'd recommend using the
q4_k
quantized weights, which we have hosted on our Hugging Face repository and that you can serve using
The bfloat16 weights hosted at the main meta-llama/Llama-3.1-8B-Instruct
repository are intended for running on GPU, and if you do want to serve those on GPU you can use
@Brad Larson But I want to benchmark performance on unquantized weights? Is there a way to do that on cpu? I am able to run bfloat16 weights using ipex on cpu. Is it not possible to do that using max?
In that case, you can remove the
--use-gpu
flag from the second invocation I list above and it'll use the bfloat16 weights. I'll caution that the bfloat16 calculations on CPU are incompatible with ARM processors, and can only run on X86_64 systems. We also haven't optimized for bfloat16 on CPU, we've targeted GPUs for that datatype and quantized datatypes for CPUs.Congrats @Brad Larson, you just advanced to level 3!
@Brad Larson Thanks a lot. That worked.
Are there any other parameters that I can configure? How many cpu cores to use, how much memory to allocate etc?
There are a couple of parameters you can use to impact serving performance, at a tradeoff of memory, such as
--max-cache-batch-size
which will set the max number of simultaneous batched requests that can be processed, and --max-length
which sets the context window size. A full readout of available parameters can be seen via magic run serve --help
.