R
RunPod4mo ago
Emad

Can't run a 70B Llama 3.1 model on 2 A100 80 gb GPUs.

Hey, so I tired running the 70B llama model on 2gpu/worker but it keeps getting stuck at the same place every time but instead if I switch to the 8B model on 1 gpu/worker with a 48gb GPU, it works easily. The issue is coming with the 70B paramater model on 2 gpus/worker.
37 Replies
nerdylive
nerdylive4mo ago
Maybe 70b needs 192gbs or smth like that
Emad
EmadOP4mo ago
RunPod Blog
Run Larger LLMs on RunPod Serverless Than Ever Before - Llama-3 70B...
Up until now, RunPod has only supported using a single GPU in Serverless, with the exception of using two 48GB cards (which honestly didn't help, given the overhead involved in multi-GPU setups for LLMs.) You were effectively limited to what you could fit in 80GB, so you would essentially be
Emad
EmadOP4mo ago
This blogpost said that 2 80GB are enough
nerdylive
nerdylive4mo ago
yeah im not sure about the minimum requirements, maybe let me check
Emad
EmadOP4mo ago
alright also how much network volume do you I think need for this?
nerdylive
nerdylive4mo ago
maybe around 150~
Emad
EmadOP4mo ago
alright thanks let me know about the requirements
nerdylive
nerdylive4mo ago
can you try other, gpu 4x
Emad
EmadOP4mo ago
alr lemme try that 4090?
nerdylive
nerdylive4mo ago
4x 48 gb srry*
Emad
EmadOP4mo ago
ok np
Emad
EmadOP4mo ago
It got stuck here again
Emad
EmadOP4mo ago
It's always at this place What do you think could be the problem @nerdylive
Emad
EmadOP4mo ago
It went a bit further now
No description
Emad
EmadOP4mo ago
and now it just shifted to a different worker
Emad
EmadOP4mo ago
No description
nerdylive
nerdylive4mo ago
still loading Maybe.. loading took to long just stop it first if you feel like its too long what gpu setup are you using?
Emad
EmadOP4mo ago
Its 4 48GB not pro, just normal
yhlong00000
yhlong000004mo ago
No description
yhlong00000
yhlong000004mo ago
I suggest opting for a GPU with vram 200G+, You can try a lower option, but performance may suffer.🥲
nerdylive
nerdylive4mo ago
yeah it should fit, but it creates new worker? why is that @yhlong00000
Emad
EmadOP4mo ago
But it says u need 140gb I gave it 196 @yhlong00000 how do you think we can fix this?
Emad
EmadOP4mo ago
But we gave it 196 Much more than 140
yhlong00000
yhlong000004mo ago
Do you mind to try a bigger vRAM, see if that helps?
Emad
EmadOP4mo ago
What do you think i should put?
yhlong00000
yhlong000004mo ago
😂 I usually start with highest memory I can put, and keep reduce it until it won't work anymore.
Emad
EmadOP4mo ago
ValueError: Total number of attention heads (64) must be divisible by tensor parallel size (6). What does this mean @yhlong00000 Now I have 384 gb but it is still getting stuck there
yhlong00000
yhlong000004mo ago
Same logs output?
Emad
EmadOP4mo ago
yea same Using model weights format ['*.safetensors'] its always at this sport it gets stuck @yhlong00000 it worked i hava 4x48gb I had to wait 6 minutes for the first time and then now its working quickly Nvm it became slow again @yhlong00000 it becomes slow when it has to load after a while Yea it a cold start issue
yhlong00000
yhlong000004mo ago
Cool👍🏻 Try to set 1 active worker, that can make sure no cold start when testing.
nerdylive
nerdylive4mo ago
don't use 6 gpus Totally unrelated, i've tried 4x48gb it works, and this too proved it works i believe strongly that this "creating new workers on model loading" thing has to do with runpod's autoscaling
Emad
EmadOP4mo ago
I tried with 48gb seems to work 4x it just takes very long to load
nerdylive
nerdylive4mo ago
yeah i was talking about this srry gratefully it works rn nice
Emad
EmadOP4mo ago
Yea
Thibaud
Thibaud4mo ago
what token speed did you achieve? what cost per token?
Emad
EmadOP4mo ago
Didn't calculate
Want results from more Discord servers?
Add your server