Brad Larson Comments - Answer Overflow

Brad Larson

•Created by Weldon Antoine III on 1/23/2025 in #questions

serve: command not found

The great thing about LLMs is that they're very much scalable: capabilities increase with model size, but it's more of a smooth spectrum than hard cutoffs. I always try to start small with locally manageable systems and scale up from there. There's a lot you can learn about running LLMs from serving a quantized Llama 3.2 1B on a local CPU, then Llama 3.1 8B in bfloat16 on GPU, then more on the way up from there. MAX has great support today for the first two items on that sequence, with more coming in the near future.

10 replies

MModular

•Created by Weldon Antoine III on 1/23/2025 in #questions

serve: command not found

Out of curiosity, is there a specific reason that you want to use that model over other LLMs? As a 314B mixture-of-experts model implemented in JAX, that could be a challenging model to get up and running. We have MAX Graph implementations of LlamaForCausalLM-family models, as well as MistralForCausalLM, MPTForCausalLM, and we're adding more all the time. Multi-GPU support is on our roadmap, but not currently generally available for MAX Graph models.

10 replies

MModular

•Created by Weldon Antoine III on 1/23/2025 in #questions

serve: command not found

Also, I'll point out that the base DeepSeek-R1 model is a 685 billion parameter one, and will need multiple GPUs to run. To get the best performance for serving on MAX, I highly recommend using --huggingface-repo-id deepseek-ai/DeepSeek-R1-Distill-Llama-8B which will use our highly optimized LlamaForCausalLM architecture with the 8B distilled version of DeepSeek R1.

10 replies

MModular

•Created by Driven on 1/12/2025 in #questions

More NVIDIA GPU Support coming?

As Owen said, our current list of officially-supported NVIDIA GPUs are the ones that we run regular CI and benchmarks against: A10, A100, L4, and L40. Unofficially, Ampere and Lovelace architectures should be supported, and we've had a number of people run RTX 30XX and 40XX GPUs successfully with MAX as of the 24.6 release. Make sure you're using at least NVIDIA driver version 555, and keep in mind that an 8 billion parameter model with bfloat16 weights will need a 24 GB RAM GPU.

3 replies

MModular

•Created by ErrorLoadingUsername on 1/11/2025 in #questions

Hey there,

Regarding the unknown memory measurement, in the 24.6 release the MAX Driver API couldn't yet read the memory statistics for a local CPU host. This has been added in the recent MAX nightlies, so if you switch to the nightly branch for our examples you should now get accurate memory measurements when running on CPU. While MAX currently will distribute work across the CPU cores within a NUMA node, MAX graphs at present will only run on an individual NUMA node. You can manually dispatch multiple compute graphs on multiple NUMA nodes using the CPU(id: [node]) constructor for a Driver API Device. Multi-device distribution of a MAX graph is on our roadmap. Apologies for not properly documenting this (we are working to do so), but the minimum NVIDIA driver version MAX supports is 555. That version or newer is needed for some of the PTX features used in MAX. If you upgrade to that version, you should be able to access your GPU via MAX.

3 replies

MModular

•Created by Noam Y on 12/10/2024 in #questions

When can we expect gpu kernels in mojo?

To circle back on this, we did release some initial simple experimental examples for how to write GPU kernels in the nightlies alongside the 24.6 release: https://forum.modular.com/t/experimental-examples-of-custom-cpu-gpu-operations-in-mojo/348 and plan to continue to iterate on those to show off more capabilities. I'll warn that this is an early preview, and we'll have much more to say about GPU programming in MAX via Mojo throughout early 2025.

5 replies

MModular

•Created by Sai Saurab Scorelabs on 12/21/2024 in #questions

Are there benchmarks available for llama 3.1 8b running on max?

There are a couple of parameters you can use to impact serving performance, at a tradeoff of memory, such as --max-cache-batch-size which will set the max number of simultaneous batched requests that can be processed, and --max-length which sets the context window size. A full readout of available parameters can be seen via magic run serve --help.

14 replies

MModular

•Created by Sai Saurab Scorelabs on 12/21/2024 in #questions

Are there benchmarks available for llama 3.1 8b running on max?

In that case, you can remove the --use-gpu flag from the second invocation I list above and it'll use the bfloat16 weights. I'll caution that the bfloat16 calculations on CPU are incompatible with ARM processors, and can only run on X86_64 systems. We also haven't optimized for bfloat16 on CPU, we've targeted GPUs for that datatype and quantized datatypes for CPUs.

14 replies

MModular

•Created by Sai Saurab Scorelabs on 12/21/2024 in #questions

Are there benchmarks available for llama 3.1 8b running on max?

@Sai Saurab Scorelabs For serving on CPU, I'd recommend using the q4_k quantized weights, which we have hosted on our Hugging Face repository and that you can serve using

magic run serve --huggingface-repo-id=modularai/llama-3.1

magic run serve --huggingface-repo-id=modularai/llama-3.1

The bfloat16 weights hosted at the main meta-llama/Llama-3.1-8B-Instruct repository are intended for running on GPU, and if you do want to serve those on GPU you can use

magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct --use-gpu --quantization-encoding bfloat16

magic run serve --huggingface-repo-id=meta-llama/Llama-3.1-8B-Instruct --use-gpu --quantization-encoding bfloat16

14 replies

MModular

•Created by Darkmatter on 12/18/2024 in #questions

How should I be loading the `get_scalar_from_managed_tensor_slice` kernel?

Are you setting the environment variable MODULAR_ONLY_USE_NEW_EXTENSIBILITY_API to true when running this? The Magic commands do this in the current examples, and if that's missing you might not be going down the right path to support a custom op with our new extensibility API.

6 replies

MModular

•Created by Manuel Saelices on 12/17/2024 in #questions

Tried the new Max custom_ops examples with my RTX 3050 and using CPU

I started a Discourse thread to at least provide a place to list officially-supported hardware and discuss anything beyond that which people have found to work: https://forum.modular.com/t/nvidia-hardware-support-in-max-24-6/340

17 replies

MModular

•Created by Manuel Saelices on 12/17/2024 in #questions

Tried the new Max custom_ops examples with my RTX 3050 and using CPU

We've updated the system requirements (Linux tab) with the officially supported GPUs: https://docs.modular.com/max/faq/#system-requirements . We can open up a Discourse thread for more informal discussions about GPU support.

17 replies

MModular

•Created by Manuel Saelices on 12/17/2024 in #questions

Tried the new Max custom_ops examples with my RTX 3050 and using CPU

I will say that the A10, A100, L4, L40 are our initial officially supported NVIDIA GPUs. Other Ampere and newer GPUs may work, but they're "use at your own risk" in terms of support right now.

17 replies

MModular

•Created by Manuel Saelices on 12/17/2024 in #questions

Tried the new Max custom_ops examples with my RTX 3050 and using CPU

We should support Ampere and newer GPUs (with the possible exception of the Jetson Orin), and the RTX 3050 should fall in the sm_80 CUDA capabilities that we support. If the GPU was found, but was an older CUDA architecture than we support you'd get a different error message. Seems like it's somehow not finding the GPU at all.

17 replies

MModular

•Created by staycia930 on 7/2/2024 in #questions

I do not know why the output is like this!

In the first, you've provided a type to the name argument (String), whereas in the second you've left name untyped. In the latter case, Mojo will default to object for the argument in a def function. This causes the slightly different printing behavior between the two types.

3 replies

MModular

•Created by sa-code on 6/11/2024 in #questions

Importing package in test

There is an open feature request to at least address the need for the -I . import: https://github.com/modularml/mojo/issues/2916 , although the LSP issues wouldn't be covered by that.

3 replies

MModular

•Created by noahlt on 6/8/2024 in #questions

list of pre-implemented models?

There are a few different ways to define a model for inference via MAX: in TorchScript, in ONNX, or construct in Mojo via the Graph API. We show several examples of TorchScript and ONNX models here: https://github.com/modularml/max/tree/main/examples/inference , which currently include BERT, Mistral 7B, ResNet-50, Stable Diffusion, and YOLOv8. New in 24.4 are end-to-end pipelines that we've defined in Mojo and that use the MAX Graph API to construct the computational graph: https://github.com/modularml/max/tree/main/examples/graph-api/pipelines . We're referring to them as pipelines because the idea is that you can define all pre- and post-processing in Mojo as well (such as the tokenizer used in Llama 3) and easily incorporate them into a larger Mojo application. We've seeded this group with a few representative pipelines, and Llama 3 is the lead example among those. We're extremely interested in having the community build upon these, as well as hearing what you'd like to see as additional examples, so please let us know how we can make this a better resource. We plan to regularly expand these examples.

3 replies

Gaming

Programming