Modular•3mo ago

serve: command not found

Tried to run the command, any ideas on the issue: ❯ magic run serve --huggingface-repo-id deepseek-ai/DeepSeek-R1 serve: command not found

8 Replies

Josh Peterson•3mo ago

Are you looking to follow this tutorial? https://docs.modular.com/max/tutorials/deploy-pytorch-llm If so, make sure you cloned the max repository and are in the max/pipelines/python/ directory. The magic run serve command is defined in that directory here: https://github.com/modular/max/blob/main/pipelines/python/pixi.toml#L13. I suspect you are not in the correct directory, so magic cannot find that command.

Deploy a PyTorch model from Hugging Face | Modular

Learn how to deploy PyTorch models from Hugging Face using a MAX Docker container

GitHub

max/pipelines/python/pixi.toml at main · modular/max

A collection of sample programs, notebooks, and tools which highlight the power of the MAX Platform - modular/max

Brad Larson•3mo ago

Also, I'll point out that the base DeepSeek-R1 model is a 685 billion parameter one, and will need multiple GPUs to run. To get the best performance for serving on MAX, I highly recommend using --huggingface-repo-id deepseek-ai/DeepSeek-R1-Distill-Llama-8B which will use our highly optimized LlamaForCausalLM architecture with the 8B distilled version of DeepSeek R1.

Robert•3mo ago

Sheesh the community is already benchmarking newly released models :mojonightly: I want to benchmark the open-source Grok 1 model on GitHub The repo is posted on GitHub if anyone missed its release 🤔

Weldon Antoine IIIOP•3mo ago

@Josh Peterson Ah thanks that makes sense. @Brad Larson Hmm ok 🤙 @Robert 🤙

Brad Larson•3mo ago

Out of curiosity, is there a specific reason that you want to use that model over other LLMs? As a 314B mixture-of-experts model implemented in JAX, that could be a challenging model to get up and running. We have MAX Graph implementations of LlamaForCausalLM-family models, as well as MistralForCausalLM, MPTForCausalLM, and we're adding more all the time. Multi-GPU support is on our roadmap, but not currently generally available for MAX Graph models.

Robert•3mo ago

Hey Brad 👋 I might get remote access to my school’s HPC cluster soon my request is still pending. But I’m open to testing other frameworks. Another reason is to analyze industry built models. I recently discovered about quantizing models down. But no preference honestly 👍

Brad Larson•3mo ago

The great thing about LLMs is that they're very much scalable: capabilities increase with model size, but it's more of a smooth spectrum than hard cutoffs. I always try to start small with locally manageable systems and scale up from there. There's a lot you can learn about running LLMs from serving a quantized Llama 3.2 1B on a local CPU, then Llama 3.1 8B in bfloat16 on GPU, then more on the way up from there. MAX has great support today for the first two items on that sequence, with more coming in the near future.

Robert•3mo ago

685B parameters holy cow. 314B parameter models. If I do grad school maybe I can incorporate this into a thesis :SquirtleSquadCool:

Gaming

Programming

serve: command not found

Did you find this page helpful?