serve: command not found

Tried to run the command, any ideas on the issue: ❯ magic run serve --huggingface-repo-id deepseek-ai/DeepSeek-R1 serve: command not found
8 Replies
Josh Peterson
Josh Peterson2w ago
Are you looking to follow this tutorial? https://docs.modular.com/max/tutorials/deploy-pytorch-llm If so, make sure you cloned the max repository and are in the max/pipelines/python/ directory. The magic run serve command is defined in that directory here: https://github.com/modular/max/blob/main/pipelines/python/pixi.toml#L13. I suspect you are not in the correct directory, so magic cannot find that command.
Deploy a PyTorch model from Hugging Face | Modular
Learn how to deploy PyTorch models from Hugging Face using a MAX Docker container
GitHub
max/pipelines/python/pixi.toml at main · modular/max
A collection of sample programs, notebooks, and tools which highlight the power of the MAX Platform - modular/max
Brad Larson
Brad Larson2w ago
Also, I'll point out that the base DeepSeek-R1 model is a 685 billion parameter one, and will need multiple GPUs to run. To get the best performance for serving on MAX, I highly recommend using --huggingface-repo-id deepseek-ai/DeepSeek-R1-Distill-Llama-8B which will use our highly optimized LlamaForCausalLM architecture with the 8B distilled version of DeepSeek R1.
Robert
Robert2w ago
Sheesh the community is already benchmarking newly released models :mojonightly: I want to benchmark the open-source Grok 1 model on GitHub The repo is posted on GitHub if anyone missed its release 🤔
Weldon Antoine III
@Josh Peterson Ah thanks that makes sense. @Brad Larson Hmm ok 🤙 @Robert 🤙
Brad Larson
Brad Larson2w ago
Out of curiosity, is there a specific reason that you want to use that model over other LLMs? As a 314B mixture-of-experts model implemented in JAX, that could be a challenging model to get up and running. We have MAX Graph implementations of LlamaForCausalLM-family models, as well as MistralForCausalLM, MPTForCausalLM, and we're adding more all the time. Multi-GPU support is on our roadmap, but not currently generally available for MAX Graph models.
Robert
Robert2w ago
Hey Brad 👋 I might get remote access to my school’s HPC cluster soon my request is still pending. But I’m open to testing other frameworks. Another reason is to analyze industry built models. I recently discovered about quantizing models down. But no preference honestly 👍
Brad Larson
Brad Larson2w ago
The great thing about LLMs is that they're very much scalable: capabilities increase with model size, but it's more of a smooth spectrum than hard cutoffs. I always try to start small with locally manageable systems and scale up from there. There's a lot you can learn about running LLMs from serving a quantized Llama 3.2 1B on a local CPU, then Llama 3.1 8B in bfloat16 on GPU, then more on the way up from there. MAX has great support today for the first two items on that sequence, with more coming in the near future.
Robert
Robert2w ago
685B parameters holy cow. 314B parameter models. If I do grad school maybe I can incorporate this into a thesis :SquirtleSquadCool:

Did you find this page helpful?