R
RunPod5mo ago
octopus

Help: Serverless Mixtral OutOfMemory Error

I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models Settings I'm using: GPU: 48GB (also tried 80GB) Container Image: runpod/worker-vllm:0.3.0-cuda11.8.0 Env 1: MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ (Also tried: casperhansen/mixtral-instruct-awq and TheBloke/firefly-mixtral-8x7b-GPTQ and mistralai/Mixtral-8x7B-Instruct-v0.1 Env 2: TRUST_REMOTE_CODE=1 Env 3: QUANTIZATION=awq || gptq for gptq models What am I doing wrong?? @Alpay Ariyak ERROR Log:
WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
36 Replies
Concept
Concept5mo ago
I feel like the devs should put out a tutorial for loading mixtral use cases. Lots of people seem to be having trouble with it.
octopus
octopus5mo ago
@Concept were you able to run Mixtral with Exllama as loader in vLLM?
ashleyk
ashleyk5mo ago
Yeah blog post might be useful, a client of mine wanted me to set it up for them too but I failed.
octopus
octopus5mo ago
I'm wondering how did some of these people who raised the issue on github were able to run it eventually?
Concept
Concept5mo ago
Took vllm completely out of the equation.
ashleyk
ashleyk5mo ago
Do you by any chance have a Github repo you can share?
Concept
Concept5mo ago
I'll dm you the tutorial
octopus
octopus5mo ago
Can you send me too? Would be good to put it here in case others come across same issue @Concept
Concept
Concept5mo ago
its from a competitor so i'm going to hold off from posting
Alpay Ariyak
Alpay Ariyak5mo ago
Try MAX_SEQUENCE_LENGTH=8192
octopus
octopus5mo ago
@Alpay Ariyak still getting the same error :/
Alpay Ariyak
Alpay Ariyak5mo ago
I’ll try to get it running tonight
octopus
octopus5mo ago
plz thank you!
JJonahJ
JJonahJ5mo ago
I only got a mixtral working by putting the context a lot lower than I’d hoped to… Edit: actually looking at my template I didn’t set that environment variable. 🤷‍♂️
octopus
octopus5mo ago
at least you got it working though! what value did you put? By context you mean adding MAX_SEQUENCE_LENGTH in env vars right?
JJonahJ
JJonahJ5mo ago
I just looked, all I have is model name and quantization And yeah, I guess I tried a few times before hitting it lucky with that.
dudicious
dudicious5mo ago
ENFORCE_EAGER? I've had trouble with some quantized models if I don't use eager mode. I wonder if there is some kind of bug with CUDA graphs on quantized models? They always take up way more memory than I'm expecting
Alpay Ariyak
Alpay Ariyak5mo ago
Sorry folks got caught up, haven’t had the time yet for the config Yeah that would be another good option to try enabling
octopus
octopus5mo ago
@Alpay Ariyak any updates about this? It seems like vllm worker is not working with any of the models. Keeps giving the same OOM error
Alpay Ariyak
Alpay Ariyak5mo ago
On this now Huggingface still down, can't test🥲
octopus
octopus5mo ago
HF is up now but btw I’m seeing this error for all models not just Mixtral
Alpay Ariyak
Alpay Ariyak5mo ago
The OOM error?
Alpay Ariyak
Alpay Ariyak5mo ago
GitHub
Mixtral AWQ uses massive amount of memory when using its long conte...
vllm 0.2.7 with cuda 12.1. python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model=TheBloke/dolphin-2.7-mixtral-8x7b-AWQ --seed 1234 --trust-remote-code --quantization awq -...
octopus
octopus5mo ago
I’m using quantized version though. Also tried with non-Mixtral and it still gave the same error. Is the template working for you for any large models?
Alpay Ariyak
Alpay Ariyak5mo ago
Yeah the issue is referring to the quantized version specifically I'm trying to load it with a few different settings now Got it to work witht the following configuration:
ENFORCE_EAGER=1
MAX_MODEL_LENGTH=8192
GPU_MEMORY_UTILIZATION=0.9
QUANTIZATION=awq
TRUST_REMOTE_CODE=1
MODEL_NAME=casperhansen/mixtral-instruct-awq
ENFORCE_EAGER=1
MAX_MODEL_LENGTH=8192
GPU_MEMORY_UTILIZATION=0.9
QUANTIZATION=awq
TRUST_REMOTE_CODE=1
MODEL_NAME=casperhansen/mixtral-instruct-awq
octopus
octopus5mo ago
Ohh cool! I’ll try @Alpay Ariyak can you please try with this model: LoneStriker/Air-Striker-Mixtral-8x7B-Instruct-ZLoss-3.75bpw-h6-exl2 Still getting OOM error for it
Alpay Ariyak
Alpay Ariyak5mo ago
What kind of quantization is that?
ashleyk
ashleyk5mo ago
ExLlama2
Alpay Ariyak
Alpay Ariyak5mo ago
vLLM doesn't support it
octopus
octopus5mo ago
Exllamav2_HF is not supported? It’s the loader I’m not sure abt the quantization
Alpay Ariyak
Alpay Ariyak5mo ago
In general, for OOM, you just have to keep playing around with the env vars in the following order: 1. lower MAX_MODEL_LENGTH (lowers number of tokens) 2. lower GPU_MEMORY_UTILIZATION (lowers number of concurrent requests) 3. set ENFORCE_EAGER to 1 (disables cuda graphs, reducing the throughput) Once you find something that works, you can start optimizing by experimenting with higher values to find the most you can get away with
ashleyk
ashleyk5mo ago
There is this, but it doesn't support concurrency https://github.com/ashleykleynhans/runpod-worker-exllamav2
GitHub
GitHub - ashleykleynhans/runpod-worker-exllamav2: RunPod Serverless...
RunPod Serverless worker for ExllamaV2. Contribute to ashleykleynhans/runpod-worker-exllamav2 development by creating an account on GitHub.
Alpay Ariyak
Alpay Ariyak5mo ago
These are the only quantization options supported by vLLM
No description
octopus
octopus5mo ago
Cool! yeah the casperhansen/mixtral-instruct-awq worked with your settings.
dudicious
dudicious5mo ago
If you are using ENFORCE_EAGER you should be able to increase GPU_MEMORY_UTILIZATION and MAX_MODEL_LENGTH on a 48gb endpoint
octopus
octopus5mo ago
awesome! thanks!