RunPod•9mo ago

Workers configuration for Serverless vLLM endpoints: 1 hour lecture with 50 students

Hey there, I need to showcase 50 students how to do RAG with open-source LLMs (i.e., LLama3). Which type of configuration do you suggest? I wanna make sure they have a smooth experience. Thanks!

Solution:

16GB isn't enough, you need 24GB

Jump to solution

11 Replies

digigoblin•9mo ago

Depends on which LLama3 model

Madiator2011•9mo ago

for 70b non quant you would need at least 2x80GB

nerdylive•9mo ago

Or 8x 24 works Why don't use pods btw?

digigoblin•9mo ago

Pods are expensive

nerdylive•9mo ago

__den3b__OP•9mo ago

8b params can also suffice

nerdylive•9mo ago

1x 24gb vram gpu works, 16gb might work aswell

Solution

digigoblin•9mo ago

16GB isn't enough, you need 24GB

digigoblin•9mo ago

Unless you use a quantized version

digigoblin•9mo ago

You can also use this model if you want it uncensored: https://huggingface.co/cognitivecomputations/dolphin-2.9.1-llama-3-8b

cognitivecomputations/dolphin-2.9.1-llama-3-8b · Hugging Face

nerdylive•9mo ago

Gaming

Programming

Workers configuration for Serverless vLLM endpoints: 1 hour lecture with 50 students

Did you find this page helpful?