Workers configuration for Serverless vLLM endpoints: 1 hour lecture with 50 students
Hey there, I need to showcase 50 students how to do RAG with open-source LLMs (i.e., LLama3). Which type of configuration do you suggest? I wanna make sure they have a smooth experience. Thanks!
11 Replies
Depends on which LLama3 model
for 70b non quant you would need at least 2x80GB
Or 8x 24 works
Why don't use pods btw?
Pods are expensive
Ic
8b params can also suffice
1x 24gb vram gpu works, 16gb might work aswell
Solution
16GB isn't enough, you need 24GB
Unless you use a quantized version
You can also use this model if you want it uncensored:
https://huggingface.co/cognitivecomputations/dolphin-2.9.1-llama-3-8b
Oh