RunPod•14mo ago

LLM inference on serverless solution

Hi, need some suggestion on serving LLM model on serverless. I have several questions: 1. Is there any guide or example project I can follow so that can infer effectively on runpod serverless? 2. Is it recommended to use frameworks like TGI or vLLM with runpod? If so why? I'd like maximum control on the inference code so I have not tried any of those frameworks Thanks!

7 Replies

ashleyk•14mo ago

RunPod have created a vLLM worker that you can use for serverless: https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

ribbitOP•14mo ago

Thanks! but I heard that vllm does not support quantized models? One of the reason I'd want maximum control to the inference code is that I want to run quantized models on other lib than transformers (exllama, etc.)

ashleyk•14mo ago

It does support some quantization types

ashleyk•14mo ago

ribbitOP•14mo ago

I see, seems like exl is not yet supported, what are the real advantages of using vllm tho?

ashleyk•14mo ago

concurrency

ribbitOP•14mo ago

I see, will explore more on that, thanks!

Gaming

Programming

LLM inference on serverless solution

Did you find this page helpful?