Help: Serverless Mixtral OutOfMemory Error
I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models
Settings I'm using:
GPU:
48GB (also tried 80GB)
Container Image: runpod/worker-vllm:0.3.0-cuda11.8.0
Env 1: MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
(Also tried: casperhansen/mixtral-instruct-awq
and TheBloke/firefly-mixtral-8x7b-GPTQ
and mistralai/Mixtral-8x7B-Instruct-v0.1
Env 2: TRUST_REMOTE_CODE=1
Env 3: QUANTIZATION=awq
|| gptq
for gptq models
What am I doing wrong?? @Alpay Ariyak
ERROR Log:
36 Replies
I feel like the devs should put out a tutorial for loading mixtral use cases. Lots of people seem to be having trouble with it.
@Concept were you able to run Mixtral with Exllama as loader in vLLM?
Yeah blog post might be useful, a client of mine wanted me to set it up for them too but I failed.
I'm wondering how did some of these people who raised the issue on github were able to run it eventually?
Took vllm completely out of the equation.
Do you by any chance have a Github repo you can share?
I'll dm you the tutorial
Can you send me too? Would be good to put it here in case others come across same issue
@Concept
its from a competitor so i'm going to hold off from posting
Try
MAX_SEQUENCE_LENGTH=8192
@Alpay Ariyak still getting the same error :/
I’ll try to get it running tonight
plz thank you!
I only got a mixtral working by putting the context a lot lower than I’d hoped to…
Edit: actually looking at my template I didn’t set that environment variable. 🤷♂️
at least you got it working though! what value did you put? By context you mean adding MAX_SEQUENCE_LENGTH in env vars right?
I just looked, all I have is model name and quantization
And yeah, I guess I tried a few times before hitting it lucky with that.
ENFORCE_EAGER?
I've had trouble with some quantized models if I don't use eager mode.
I wonder if there is some kind of bug with CUDA graphs on quantized models? They always take up way more memory than I'm expecting
Sorry folks got caught up, haven’t had the time yet for the config
Yeah that would be another good option to try enabling
@Alpay Ariyak any updates about this?
It seems like vllm worker is not working with any of the models. Keeps giving the same OOM error
On this now
Huggingface still down, can't test🥲
HF is up now but btw I’m seeing this error for all models not just Mixtral
The OOM error?
GitHub
Mixtral AWQ uses massive amount of memory when using its long conte...
vllm 0.2.7 with cuda 12.1. python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model=TheBloke/dolphin-2.7-mixtral-8x7b-AWQ --seed 1234 --trust-remote-code --quantization awq -...
I’m using quantized version though. Also tried with non-Mixtral and it still gave the same error. Is the template working for you for any large models?
Yeah the issue is referring to the quantized version specifically
I'm trying to load it with a few different settings now
Got it to work witht the following configuration:
Ohh cool! I’ll try
@Alpay Ariyak can you please try with this model:
LoneStriker/Air-Striker-Mixtral-8x7B-Instruct-ZLoss-3.75bpw-h6-exl2
Still getting OOM error for itWhat kind of quantization is that?
ExLlama2
vLLM doesn't support it
Exllamav2_HF is not supported?
It’s the loader I’m not sure abt the quantization
In general, for OOM, you just have to keep playing around with the env vars in the following order:
1. lower
MAX_MODEL_LENGTH
(lowers number of tokens)
2. lower GPU_MEMORY_UTILIZATION
(lowers number of concurrent requests)
3. set ENFORCE_EAGER
to 1
(disables cuda graphs, reducing the throughput)
Once you find something that works, you can start optimizing by experimenting with higher values to find the most you can get away withThere is this, but it doesn't support concurrency
https://github.com/ashleykleynhans/runpod-worker-exllamav2
GitHub
GitHub - ashleykleynhans/runpod-worker-exllamav2: RunPod Serverless...
RunPod Serverless worker for ExllamaV2. Contribute to ashleykleynhans/runpod-worker-exllamav2 development by creating an account on GitHub.
These are the only quantization options supported by vLLM
Cool! yeah the
casperhansen/mixtral-instruct-awq
worked with your settings.If you are using
ENFORCE_EAGER
you should be able to increase GPU_MEMORY_UTILIZATION
and MAX_MODEL_LENGTH
on a 48gb endpointawesome! thanks!