RunPod•15mo ago

Help: Serverless Mixtral OutOfMemory Error

I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models Settings I'm using: GPU: 48GB (also tried 80GB) Container Image: runpod/worker-vllm:0.3.0-cuda11.8.0 Env 1: MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ (Also tried: casperhansen/mixtral-instruct-awq and TheBloke/firefly-mixtral-8x7b-GPTQ and mistralai/Mixtral-8x7B-Instruct-v0.1 Env 2: TRUST_REMOTE_CODE=1 Env 3: QUANTIZATION=awq || gptq for gptq models What am I doing wrong?? @Alpay Ariyak ERROR Log:

WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

36 Replies

Concept•15mo ago

I feel like the devs should put out a tutorial for loading mixtral use cases. Lots of people seem to be having trouble with it.

octopusOP•15mo ago

@Concept were you able to run Mixtral with Exllama as loader in vLLM?

ashleyk•15mo ago

Yeah blog post might be useful, a client of mine wanted me to set it up for them too but I failed.

octopusOP•15mo ago

I'm wondering how did some of these people who raised the issue on github were able to run it eventually?

Concept•15mo ago

Took vllm completely out of the equation.

ashleyk•15mo ago

Do you by any chance have a Github repo you can share?

Concept•15mo ago

I'll dm you the tutorial

octopusOP•15mo ago

Can you send me too? Would be good to put it here in case others come across same issue @Concept

Concept•15mo ago

its from a competitor so i'm going to hold off from posting

Alpay Ariyak•15mo ago

Try MAX_SEQUENCE_LENGTH=8192

octopusOP•15mo ago

@Alpay Ariyak still getting the same error :/

Alpay Ariyak•15mo ago

I’ll try to get it running tonight

octopusOP•15mo ago

plz thank you!

Toxibunny•15mo ago

I only got a mixtral working by putting the context a lot lower than I’d hoped to… Edit: actually looking at my template I didn’t set that environment variable. 🤷‍♂️

octopusOP•15mo ago

at least you got it working though! what value did you put? By context you mean adding MAX_SEQUENCE_LENGTH in env vars right?

Toxibunny•15mo ago

I just looked, all I have is model name and quantization And yeah, I guess I tried a few times before hitting it lucky with that.

dudicious•15mo ago

ENFORCE_EAGER? I've had trouble with some quantized models if I don't use eager mode. I wonder if there is some kind of bug with CUDA graphs on quantized models? They always take up way more memory than I'm expecting

Alpay Ariyak•15mo ago

Sorry folks got caught up, haven’t had the time yet for the config Yeah that would be another good option to try enabling

octopusOP•15mo ago

@Alpay Ariyak any updates about this? It seems like vllm worker is not working with any of the models. Keeps giving the same OOM error

Alpay Ariyak•15mo ago

On this now Huggingface still down, can't test🥲

octopusOP•15mo ago

HF is up now but btw I’m seeing this error for all models not just Mixtral

Alpay Ariyak•15mo ago

The OOM error?

Alpay Ariyak•15mo ago

https://github.com/vllm-project/vllm/issues/2631

GitHub

Mixtral AWQ uses massive amount of memory when using its long conte...

vllm 0.2.7 with cuda 12.1. python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model=TheBloke/dolphin-2.7-mixtral-8x7b-AWQ --seed 1234 --trust-remote-code --quantization awq -...

octopusOP•15mo ago

I’m using quantized version though. Also tried with non-Mixtral and it still gave the same error. Is the template working for you for any large models?

Alpay Ariyak•15mo ago

Yeah the issue is referring to the quantized version specifically I'm trying to load it with a few different settings now Got it to work witht the following configuration:

ENFORCE_EAGER=1
MAX_MODEL_LENGTH=8192
GPU_MEMORY_UTILIZATION=0.9
QUANTIZATION=awq
TRUST_REMOTE_CODE=1
MODEL_NAME=casperhansen/mixtral-instruct-awq

ENFORCE_EAGER=1
MAX_MODEL_LENGTH=8192
GPU_MEMORY_UTILIZATION=0.9
QUANTIZATION=awq
TRUST_REMOTE_CODE=1
MODEL_NAME=casperhansen/mixtral-instruct-awq

octopusOP•15mo ago

Ohh cool! I’ll try @Alpay Ariyak can you please try with this model: LoneStriker/Air-Striker-Mixtral-8x7B-Instruct-ZLoss-3.75bpw-h6-exl2 Still getting OOM error for it

Alpay Ariyak•15mo ago

What kind of quantization is that?

ashleyk•15mo ago

ExLlama2

Alpay Ariyak•15mo ago

vLLM doesn't support it

octopusOP•15mo ago

Exllamav2_HF is not supported? It’s the loader I’m not sure abt the quantization

Alpay Ariyak•15mo ago

In general, for OOM, you just have to keep playing around with the env vars in the following order: 1. lower MAX_MODEL_LENGTH (lowers number of tokens) 2. lower GPU_MEMORY_UTILIZATION (lowers number of concurrent requests) 3. set ENFORCE_EAGER to 1 (disables cuda graphs, reducing the throughput) Once you find something that works, you can start optimizing by experimenting with higher values to find the most you can get away with

ashleyk•15mo ago

There is this, but it doesn't support concurrency https://github.com/ashleykleynhans/runpod-worker-exllamav2

GitHub

GitHub - ashleykleynhans/runpod-worker-exllamav2: RunPod Serverless...

RunPod Serverless worker for ExllamaV2. Contribute to ashleykleynhans/runpod-worker-exllamav2 development by creating an account on GitHub.

Alpay Ariyak•15mo ago

These are the only quantization options supported by vLLM

octopusOP•15mo ago

Cool! yeah the casperhansen/mixtral-instruct-awq worked with your settings.

dudicious•15mo ago

If you are using ENFORCE_EAGER you should be able to increase GPU_MEMORY_UTILIZATION and MAX_MODEL_LENGTH on a 48gb endpoint

octopusOP•15mo ago

awesome! thanks!

Gaming

Programming

Help: Serverless Mixtral OutOfMemory Error

Did you find this page helpful?