RunPod•11mo ago

Run Mixtral 8x22B Instruct on vLLM worker

Hello everybody, is it possible to run mixtral 8x22B on vLLM worker i tried to run it on the default configuration with 48 gb GPU A6000, A40 but its taking too long, what are the requirements for running mixtral 8x22B successfully ? this is the model that im trying to run https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1

mistralai/Mixtral-8x22B-Instruct-v0.1 · Hugging Face

81 Replies

mdOP•11mo ago

Sorry im new to using GPU's for LLM models

nerdylive•11mo ago

oh it actually needs a bunch of vram to run you can try using half for running it in lower vrams all good

mdOP•11mo ago

thanks for the reply what do you mean by half

nerdylive•11mo ago

in the quantization part use any to run it in lower vrams

mdOP•11mo ago

let me check which GPU would be suitable to run this btw ?

nerdylive•11mo ago

oh wait the DTYPE i mean*

nerdylive•11mo ago

nerdylive•11mo ago

also this too

nerdylive•11mo ago

try experimenting with those in the env variables of your endpoint

mdOP•11mo ago

sure

nerdylive•11mo ago

Environment variables | RunPod Documentation

Environment variables configure your vLLM Worker by providing control over model selection, access credentials, and operational parameters necessary for optimal Worker performance.

mdOP•11mo ago

i think this would also be a good option to set right ? since it will divide the memory

nerdylive•11mo ago

never tried vllm worker yet tbh, sure if you want to try it go ahead yeah seems like a good option to try

mdOP•11mo ago

ah i see, would there be any substantial decrease in quality if i ran the model in half memory ?

nerdylive•11mo ago

maybe try browsing around on quant, etc

mdOP•11mo ago

cool thanks

nerdylive•11mo ago

Because I don't know about them but yeah i think it will if you use the lower dtype

mdOP•11mo ago

yeah makes sense @nerdylive looks like mixtral 8x2b requires up to 300 gb of vram, and the highest available gpu is 80gb of vram if it uses half amount of memory which would be 150 gb it should be possible to divide 50gb of vram between 3 workers. idk if thats possible do you know somebody from the team that can help me out here ? my company actually wants to deploy this model for our product.

nerdylive•11mo ago

Wew where did you get that estimate from Yeah it's a huge model

mdOP•11mo ago

from the mistral discord

nerdylive•11mo ago

Nah, it's not possible yet to divide them onto 3 workers I think I c Pods is possible to use multiple gpu Or maybe explore accelerate for this ( not sure )

mdOP•11mo ago

i see let me search, though could somebody from the runpod team confirm this ?

nerdylive•11mo ago

Maybe contact support for that Like from the website

mdOP•11mo ago

ah ok, i misunderstood this is a community server sorry

nerdylive•11mo ago

Yeah there are some staffs here but they are easier to manage support request via website support It's fine

mdOP•11mo ago

Yeah i will contact through official channels thanks for all the help appreciate it how do i mark this post as solved ?

nerdylive•11mo ago

Up to you did you find a way to run that model yet? Without the worker splitting idea I'd like to know your updates too haha, I think dont mark it first

mdOP•11mo ago

the only way i have right now is to use a vm with 300gb of vram but it would be costly and not sure if i can find a vm like that, i opted for runpod because it had cheap pricing and easy deployments sure i will post updates here one guy in mistral discord also wanted to split memory in order to run the model between 4x gpus they suggested vLLM for this which is what runpod workers are using i think

nerdylive•11mo ago

Vllm supports that?

mdOP•11mo ago

i havent looked into it yet but they suggested it and TGI

nerdylive•11mo ago

try asking which feature is it or how

mdOP•11mo ago

yeah

nerdylive•11mo ago

Thanks

mdOP•11mo ago

@nerdylive this is what i got

mdOP•11mo ago

https://docs.mistral.ai/deployment/self-deployment/vllm/ in this guide they set the tensor parallel size to 4 i wonder if runpod does it as well

vLLM | Mistral AI Large Language Models

vLLM can be deployed using a docker image we provide, or directly from the python package.

nerdylive•11mo ago

Oh I think it's configurable Check the vllm docs on runpod

mdOP•11mo ago

let me check

mdOP•11mo ago

oh this was the option lol

mdOP•11mo ago

i set it to 3 but still ran out of memory i used 2 gpu per worker as well actually 80gb

nerdylive•11mo ago

Wow it works?

mdOP•11mo ago

no i ran out of memory even with the above config

Tobias Fuchs•11mo ago

I'm trying to do the same thing as you right now lol will update if I figure something out

mdOP•11mo ago

Thanks alot

Madiator2011•11mo ago

@Alpay Ariyak mayby you could help here 🙂

Alpay Ariyak•11mo ago

Hi, you need at least 2x80Gb GPUs afaik

mdOP•11mo ago

Hey, yes I used 2x 80 GB GPU per worker with 3 workers but I got an error torch.cuda ran out of memory while trying to allocate

nerdylive•11mo ago

Wait what there is 2x 80gb? I thought it was for 48 gbs only How did you get that Oof still need more memory huh, try sending the full logs

mdOP•11mo ago

yeah i will try soon i just selected the option 2 gpu per worker and 80gb H100

Bryan•11mo ago

Oh? I can only do 2 GPUs per worker with 48GB GPUs, not 80GB GPUs. Are you sure?

Bryan•11mo ago

Unless you're doing a pod instead of serverless In which case ignore me 🙂

Alpay Ariyak•11mo ago

My apologies, you actually need 4x80gb for 8x22B

nerdylive•11mo ago

is that actually possible in serverless

Alpay Ariyak•11mo ago

Not with the current limits, no

nerdylive•11mo ago

ah that sucks alright btw, what are streams transported in? sse? how do i retrieve it on python and iterate the responses async

Alpay Ariyak•11mo ago

With OpenAI compatibility?

nerdylive•11mo ago

No the stream endpoint default one {{URL}}/stream/:id

Alpay Ariyak•11mo ago

Not sse, regular get request Will return yielded outputs from the worker since last /stream call

nerdylive•11mo ago

wait so i poll the stream endpoint?

Alpay Ariyak•11mo ago

What goal do you have in mind?

nerdylive•11mo ago

Like websockets probably? i was hoping the stream endpoint to be alike

Alpay Ariyak•11mo ago

OpenAI compatibility streaming is through SSE

richterscale9•11mo ago

Hey, sorry to hijack the thread, I'm also looking into deploying vLLM on RunPod serverless. The landing page indicates that it should be possible to bring your own container, not pay for any idle time, and have <250ms cold boot. Is this true? It sounds too good to be true.

nerdylive•11mo ago

Oh what about the stream im talking about?

Alpay Ariyak•11mo ago

Yes, through flash boot That one is strictly polled

nerdylive•11mo ago

Oh alright

richterscale9•11mo ago

Does this 250ms cold boot time really include everything? Or does it only contain some things, such that the actual cold boot time might be 30 seconds or something? For example, the time to load LLM weights into memory typically takes more than 10 seconds.

Alpay Ariyak•11mo ago

Everything, due to not needing to reload weights

richterscale9•11mo ago

That's just insane if it really works

Alpay Ariyak•11mo ago

Haha try it out!

richterscale9•11mo ago

Yeah, reading the docs right now to figure out what is everything i need to do to try it... I currently have a Docker image that spins up a fork of oobabooga web ui, I'm thinking about setting that up for the serverless experiment.

mdOP•11mo ago

Yeah your actually right, i confused it as 80gb my bad guys even with using dtype half ? we need 4x80 gb ?

Bryan•11mo ago

8x22B = 176B parameters. At 16bit, 2 bytes per parameter, that's 352GB just for the model parameters At 8bit (1 byte per parameter) it's still 176GB I could be mistaken around this, I'm not an expert on this for sure But my understanding is that you can just fit 8x22B on 4x80GB with 8bit quantization

mdOP•11mo ago

I see yeah that makes sense i will revisit this in the future

Alpay Ariyak•11mo ago

316gb for 16bit according to https://huggingface.co/spaces/Vokturz/can-it-run-llm

Can You Run It? LLM version - a Hugging Face Space by Vokturz

Alpay Ariyak•11mo ago

We’re raising the serverless gpu count limits around next week I believe even up to 10x A40 per worker

nerdylive•11mo ago

WOOOO on other gpus too?

Alpay Ariyak•11mo ago

Yes, 2x of everything at the very least iirc

nerdylive•11mo ago

yay~!

mdOP•11mo ago

nice this will be useful thanks alot

nerdylive•10mo ago

Hey @Alpay Ariyak just wondering is it really normal for vllm to load big models everytime very sloow Like every request is 100secs ++ Like this mixtral or the llama 3 70b Any solutions to make that loading faster?

Gaming

Programming

Run Mixtral 8x22B Instruct on vLLM worker

Did you find this page helpful?