Brad Larson
MModular
•Created by Driven on 1/12/2025 in #questions
More NVIDIA GPU Support coming?
As Owen said, our current list of officially-supported NVIDIA GPUs are the ones that we run regular CI and benchmarks against: A10, A100, L4, and L40. Unofficially, Ampere and Lovelace architectures should be supported, and we've had a number of people run RTX 30XX and 40XX GPUs successfully with MAX as of the 24.6 release.
Make sure you're using at least NVIDIA driver version 555, and keep in mind that an 8 billion parameter model with bfloat16 weights will need a 24 GB RAM GPU.
3 replies
MModular
•Created by ErrorLoadingUsername on 1/11/2025 in #questions
Hey there,
Regarding the
unknown
memory measurement, in the 24.6 release the MAX Driver API couldn't yet read the memory statistics for a local CPU host. This has been added in the recent MAX nightlies, so if you switch to the nightly
branch for our examples you should now get accurate memory measurements when running on CPU.
While MAX currently will distribute work across the CPU cores within a NUMA node, MAX graphs at present will only run on an individual NUMA node. You can manually dispatch multiple compute graphs on multiple NUMA nodes using the CPU(id: [node])
constructor for a Driver API Device
. Multi-device distribution of a MAX graph is on our roadmap.
Apologies for not properly documenting this (we are working to do so), but the minimum NVIDIA driver version MAX supports is 555. That version or newer is needed for some of the PTX features used in MAX. If you upgrade to that version, you should be able to access your GPU via MAX.3 replies
MModular
•Created by Noam Y on 12/10/2024 in #questions
When can we expect gpu kernels in mojo?
To circle back on this, we did release some initial simple experimental examples for how to write GPU kernels in the nightlies alongside the 24.6 release: https://forum.modular.com/t/experimental-examples-of-custom-cpu-gpu-operations-in-mojo/348 and plan to continue to iterate on those to show off more capabilities. I'll warn that this is an early preview, and we'll have much more to say about GPU programming in MAX via Mojo throughout early 2025.
5 replies
MModular
•Created by Sai Saurab Scorelabs on 12/21/2024 in #questions
Are there benchmarks available for llama 3.1 8b running on max?
There are a couple of parameters you can use to impact serving performance, at a tradeoff of memory, such as
--max-cache-batch-size
which will set the max number of simultaneous batched requests that can be processed, and --max-length
which sets the context window size. A full readout of available parameters can be seen via magic run serve --help
.14 replies
MModular
•Created by Sai Saurab Scorelabs on 12/21/2024 in #questions
Are there benchmarks available for llama 3.1 8b running on max?
In that case, you can remove the
--use-gpu
flag from the second invocation I list above and it'll use the bfloat16 weights. I'll caution that the bfloat16 calculations on CPU are incompatible with ARM processors, and can only run on X86_64 systems. We also haven't optimized for bfloat16 on CPU, we've targeted GPUs for that datatype and quantized datatypes for CPUs.14 replies
MModular
•Created by Sai Saurab Scorelabs on 12/21/2024 in #questions
Are there benchmarks available for llama 3.1 8b running on max?
@Sai Saurab Scorelabs For serving on CPU, I'd recommend using the
q4_k
quantized weights, which we have hosted on our Hugging Face repository and that you can serve using
The bfloat16 weights hosted at the main meta-llama/Llama-3.1-8B-Instruct
repository are intended for running on GPU, and if you do want to serve those on GPU you can use
14 replies
MModular
•Created by Darkmatter on 12/18/2024 in #questions
How should I be loading the `get_scalar_from_managed_tensor_slice` kernel?
Are you setting the environment variable
MODULAR_ONLY_USE_NEW_EXTENSIBILITY_API
to true
when running this? The Magic commands do this in the current examples, and if that's missing you might not be going down the right path to support a custom op with our new extensibility API.6 replies
MModular
•Created by Manuel Saelices on 12/17/2024 in #questions
Tried the new Max custom_ops examples with my RTX 3050 and using CPU
I started a Discourse thread to at least provide a place to list officially-supported hardware and discuss anything beyond that which people have found to work: https://forum.modular.com/t/nvidia-hardware-support-in-max-24-6/340
17 replies
MModular
•Created by Manuel Saelices on 12/17/2024 in #questions
Tried the new Max custom_ops examples with my RTX 3050 and using CPU
We've updated the system requirements (Linux tab) with the officially supported GPUs: https://docs.modular.com/max/faq/#system-requirements . We can open up a Discourse thread for more informal discussions about GPU support.
17 replies
MModular
•Created by Manuel Saelices on 12/17/2024 in #questions
Tried the new Max custom_ops examples with my RTX 3050 and using CPU
I will say that the A10, A100, L4, L40 are our initial officially supported NVIDIA GPUs. Other Ampere and newer GPUs may work, but they're "use at your own risk" in terms of support right now.
17 replies
MModular
•Created by Manuel Saelices on 12/17/2024 in #questions
Tried the new Max custom_ops examples with my RTX 3050 and using CPU
We should support Ampere and newer GPUs (with the possible exception of the Jetson Orin), and the RTX 3050 should fall in the
sm_80
CUDA capabilities that we support. If the GPU was found, but was an older CUDA architecture than we support you'd get a different error message. Seems like it's somehow not finding the GPU at all.17 replies
MModular
•Created by staycia930 on 7/2/2024 in #questions
I do not know why the output is like this!
In the first, you've provided a type to the
name
argument (String
), whereas in the second you've left name
untyped. In the latter case, Mojo will default to object
for the argument in a def
function. This causes the slightly different printing behavior between the two types.3 replies
MModular
•Created by sa-code on 6/11/2024 in #questions
Importing package in test
There is an open feature request to at least address the need for the
-I .
import: https://github.com/modularml/mojo/issues/2916 , although the LSP issues wouldn't be covered by that.3 replies
MModular
•Created by noahlt on 6/8/2024 in #questions
list of pre-implemented models?
There are a few different ways to define a model for inference via MAX: in TorchScript, in ONNX, or construct in Mojo via the Graph API. We show several examples of TorchScript and ONNX models here: https://github.com/modularml/max/tree/main/examples/inference , which currently include BERT, Mistral 7B, ResNet-50, Stable Diffusion, and YOLOv8.
New in 24.4 are end-to-end pipelines that we've defined in Mojo and that use the MAX Graph API to construct the computational graph: https://github.com/modularml/max/tree/main/examples/graph-api/pipelines . We're referring to them as pipelines because the idea is that you can define all pre- and post-processing in Mojo as well (such as the tokenizer used in Llama 3) and easily incorporate them into a larger Mojo application. We've seeded this group with a few representative pipelines, and Llama 3 is the lead example among those.
We're extremely interested in having the community build upon these, as well as hearing what you'd like to see as additional examples, so please let us know how we can make this a better resource. We plan to regularly expand these examples.
3 replies