RunPod•9mo ago

Can't run a 70B Llama 3.1 model on 2 A100 80 gb GPUs.

Hey, so I tired running the 70B llama model on 2gpu/worker but it keeps getting stuck at the same place every time but instead if I switch to the 8B model on 1 gpu/worker with a 48gb GPU, it works easily. The issue is coming with the 70B paramater model on 2 gpus/worker.

37 Replies

Jason•9mo ago

Maybe 70b needs 192gbs or smth like that

EmadOP•9mo ago

https://blog.runpod.io/run-larger-llms-on-runpod-serverless-than-ever-before/

RunPod Blog

Run Larger LLMs on RunPod Serverless Than Ever Before - Llama-3 70B...

Up until now, RunPod has only supported using a single GPU in Serverless, with the exception of using two 48GB cards (which honestly didn't help, given the overhead involved in multi-GPU setups for LLMs.) You were effectively limited to what you could fit in 80GB, so you would essentially be

EmadOP•9mo ago

This blogpost said that 2 80GB are enough

Jason•9mo ago

yeah im not sure about the minimum requirements, maybe let me check

EmadOP•9mo ago

alright also how much network volume do you I think need for this?

Jason•9mo ago

maybe around 150~

EmadOP•9mo ago

alright thanks let me know about the requirements

Jason•9mo ago

can you try other, gpu 4x

EmadOP•9mo ago

alr lemme try that 4090?

Jason•9mo ago

4x 48 gb srry*

EmadOP•9mo ago

ok np

EmadOP•9mo ago

It got stuck here again

message.txt

EmadOP•9mo ago

It's always at this place What do you think could be the problem @nerdylive

EmadOP•9mo ago

It went a bit further now

EmadOP•9mo ago

and now it just shifted to a different worker

EmadOP•9mo ago

Jason•9mo ago

still loading Maybe.. loading took to long just stop it first if you feel like its too long what gpu setup are you using?

EmadOP•9mo ago

Its 4 48GB not pro, just normal

yhlong00000•9mo ago

I suggest opting for a GPU with vram 200G+, You can try a lower option, but performance may suffer.🥲

Jason•9mo ago

yeah it should fit, but it creates new worker? why is that @yhlong00000

EmadOP•9mo ago

But it says u need 140gb I gave it 196 @yhlong00000 how do you think we can fix this?

yhlong00000•9mo ago

https://huggingface.co/blog/llama31

Llama 3.1 - 405B, 70B & 8B with multilinguality and long context

EmadOP•9mo ago

But we gave it 196 Much more than 140

yhlong00000•9mo ago

Do you mind to try a bigger vRAM, see if that helps?

EmadOP•9mo ago

What do you think i should put?

yhlong00000•9mo ago

😂 I usually start with highest memory I can put, and keep reduce it until it won't work anymore.

EmadOP•9mo ago

ValueError: Total number of attention heads (64) must be divisible by tensor parallel size (6). What does this mean @yhlong00000 Now I have 384 gb but it is still getting stuck there

yhlong00000•9mo ago

Same logs output?

EmadOP•9mo ago

yea same Using model weights format ['*.safetensors'] its always at this sport it gets stuck @yhlong00000 it worked i hava 4x48gb I had to wait 6 minutes for the first time and then now its working quickly Nvm it became slow again @yhlong00000 it becomes slow when it has to load after a while Yea it a cold start issue

yhlong00000•9mo ago

Cool👍🏻 Try to set 1 active worker, that can make sure no cold start when testing.

Jason•9mo ago

don't use 6 gpus Totally unrelated, i've tried 4x48gb it works, and this too proved it works i believe strongly that this "creating new workers on model loading" thing has to do with runpod's autoscaling

EmadOP•9mo ago

I tried with 48gb seems to work 4x it just takes very long to load

Jason•9mo ago

yeah i was talking about this srry gratefully it works rn nice

EmadOP•9mo ago

Yea

Thibaud•9mo ago

what token speed did you achieve? what cost per token?

EmadOP•9mo ago

Didn't calculate

Gaming

Programming

Can't run a 70B Llama 3.1 model on 2 A100 80 gb GPUs.

Did you find this page helpful?