LLM inference on serverless solution
Hi, need some suggestion on serving LLM model on serverless. I have several questions:
1. Is there any guide or example project I can follow so that can infer effectively on runpod serverless?
2. Is it recommended to use frameworks like TGI or vLLM with runpod? If so why? I'd like maximum control on the inference code so I have not tried any of those frameworks
Thanks!
7 Replies
RunPod have created a vLLM worker that you can use for serverless:
https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Thanks! but I heard that vllm does not support quantized models?
One of the reason I'd want maximum control to the inference code is that I want to run quantized models on other lib than transformers (exllama, etc.)
It does support some quantization types
I see, seems like exl is not yet supported, what are the real advantages of using vllm tho?
concurrency
I see, will explore more on that, thanks!