Can't run a 70B Llama 3.1 model on 2 A100 80 gb GPUs.
Hey, so I tired running the 70B llama model on 2gpu/worker but it keeps getting stuck at the same place every time but instead if I switch to the 8B model on 1 gpu/worker with a 48gb GPU, it works easily. The issue is coming with the 70B paramater model on 2 gpus/worker.
37 Replies
Maybe 70b needs 192gbs or smth like that
RunPod Blog
Run Larger LLMs on RunPod Serverless Than Ever Before - Llama-3 70B...
Up until now, RunPod has only supported using a single GPU in Serverless, with the exception of using two 48GB cards (which honestly didn't help, given the overhead involved in multi-GPU setups for LLMs.) You were effectively limited to what you could fit in 80GB, so you would essentially be
This blogpost said that 2 80GB are enough
yeah im not sure about the minimum requirements, maybe let me check
alright also how much network volume do you I think need for this?
maybe around 150~
alright thanks
let me know about the requirements
can you try other, gpu 4x
alr lemme try that
4090?
4x 48 gb
srry*
ok
np
It got stuck here again
It's always at this place
What do you think could be the problem @nerdylive
It went a bit further now
and now it just shifted to a different worker
still loading
Maybe.. loading took to long
just stop it first if you feel like its too long
what gpu setup are you using?
Its 4 48GB not pro, just normal
I suggest opting for a GPU with vram 200G+, You can try a lower option, but performance may suffer.🥲
yeah it should fit, but it creates new worker?
why is that @yhlong00000
But it says u need 140gb
I gave it 196
@yhlong00000 how do you think we can fix this?
But we gave it 196
Much more than 140
Do you mind to try a bigger vRAM, see if that helps?
What do you think i should put?
😂 I usually start with highest memory I can put, and keep reduce it until it won't work anymore.
ValueError: Total number of attention heads (64) must be divisible by tensor parallel size (6).
What does this mean
@yhlong00000 Now I have 384 gb but it is still getting stuck there
Same logs output?
yea
same
Using model weights format ['*.safetensors']
its always at this sport it gets stuck
@yhlong00000 it worked
i hava 4x48gb
I had to wait 6 minutes for the first time
and then now its working quickly
Nvm it became slow again
@yhlong00000 it becomes slow when it has to load after a while
Yea it a cold start issue
Cool👍🏻 Try to set 1 active worker, that can make sure no cold start when testing.
don't use 6 gpus
Totally unrelated, i've tried 4x48gb it works, and this too proved it works
i believe strongly that this "creating new workers on model loading" thing has to do with runpod's autoscaling
I tried with 48gb seems to work
4x
it just takes very long to load
yeah
i was talking about this srry gratefully it works rn nice
Yea
what token speed did you achieve?
what cost per token?
Didn't calculate