serve: command not found
Tried to run the command, any ideas on the issue:
❯ magic run serve --huggingface-repo-id deepseek-ai/DeepSeek-R1
serve: command not found
8 Replies
Are you looking to follow this tutorial? https://docs.modular.com/max/tutorials/deploy-pytorch-llm
If so, make sure you cloned the max repository and are in the
max/pipelines/python/
directory. The magic run serve
command is defined in that directory here: https://github.com/modular/max/blob/main/pipelines/python/pixi.toml#L13. I suspect you are not in the correct directory, so magic cannot find that command.Deploy a PyTorch model from Hugging Face | Modular
Learn how to deploy PyTorch models from Hugging Face using a MAX Docker container
GitHub
max/pipelines/python/pixi.toml at main · modular/max
A collection of sample programs, notebooks, and tools which highlight the power of the MAX Platform - modular/max
Also, I'll point out that the base DeepSeek-R1 model is a 685 billion parameter one, and will need multiple GPUs to run. To get the best performance for serving on MAX, I highly recommend using
--huggingface-repo-id deepseek-ai/DeepSeek-R1-Distill-Llama-8B
which will use our highly optimized LlamaForCausalLM architecture with the 8B distilled version of DeepSeek R1.Sheesh the community is already benchmarking newly released models :mojonightly: I want to benchmark the open-source Grok 1 model on GitHub
The repo is posted on GitHub if anyone missed its release 🤔
@Josh Peterson Ah thanks that makes sense.
@Brad Larson Hmm ok 🤙
@Robert 🤙
Out of curiosity, is there a specific reason that you want to use that model over other LLMs? As a 314B mixture-of-experts model implemented in JAX, that could be a challenging model to get up and running. We have MAX Graph implementations of LlamaForCausalLM-family models, as well as MistralForCausalLM, MPTForCausalLM, and we're adding more all the time. Multi-GPU support is on our roadmap, but not currently generally available for MAX Graph models.
Hey Brad 👋 I might get remote access to my school’s HPC cluster soon my request is still pending. But I’m open to testing other frameworks. Another reason is to analyze industry built models. I recently discovered about quantizing models down. But no preference honestly 👍
The great thing about LLMs is that they're very much scalable: capabilities increase with model size, but it's more of a smooth spectrum than hard cutoffs. I always try to start small with locally manageable systems and scale up from there. There's a lot you can learn about running LLMs from serving a quantized Llama 3.2 1B on a local CPU, then Llama 3.1 8B in bfloat16 on GPU, then more on the way up from there. MAX has great support today for the first two items on that sequence, with more coming in the near future.
685B parameters holy cow. 314B parameter models. If I do grad school maybe I can incorporate this into a thesis :SquirtleSquadCool: