RunPod•14mo ago

Can we add minimum GPU configs required for running the popular models like Mistral, Mixtral?

I'm trying to find what serverless GPU configs are required to run Mixtral 8x7B-Instruct either quantized (https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ) or the main from Mistral. It would be good to have this info in the ReadMe in vLLM Worker Repo. I run into OutOfMemory issues when trying it on 48GB GPU.

TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ · Hugging Face

11 Replies

ashleyk•14mo ago

Configs where? I assume this is for the vllm worker? For main Mixtral, non quant you need at least 2 x A100.

octopusOP•14mo ago

Right using vLLM worker. And do you know about gptq quantized Mixtral? Thanks!

ashleyk•14mo ago

No, don't know, but here is how much you need for the main model: https://www.youtube.com/watch?v=WjiX3lCnwUI

Matthew Berman

YouTube

Mixtral 8x7B DESTROYS Other Models (MoE = AGI?)

MistralAI is at it again. They've released an MoE (mixture of experts) model that completely dominates the open-source world. Here's a breakdown of what they released, plus an installation guide and an LLM test. * Sorry for the part where my face gets blurry Download the EdrawMind for Free:https://bit.ly/46xIp8G and SAVE UP TO 40% discount he...

ashleyk•14mo ago

@Alpay Ariyak may be able to advise since he maintains the vllm worker for RunPod.

octopusOP•14mo ago

@Alpay Ariyak I tried using A100 with 80GB GPU on serverless for quantized Mixtral but still out of memory errors:

...
2024-02-26T14:33:29.975180031Z   File "/vllm-installation/vllm/model_executor/layers/quantization/gptq.py", line 205, in apply_weights
2024-02-26T14:33:29.975202231Z     output = ops.gptq_gemm(reshaped_x, weights["qweight"],
2024-02-26T14:33:29.975208532Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 94.12 MiB is free. Process 3657454 has 79.04 GiB memory in use. Of the allocated memory 70.74 GiB is allocated by PyTorch, and 168.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-02-26T14:33:39.526605087Z

...
2024-02-26T14:33:29.975180031Z   File "/vllm-installation/vllm/model_executor/layers/quantization/gptq.py", line 205, in apply_weights
2024-02-26T14:33:29.975202231Z     output = ops.gptq_gemm(reshaped_x, weights["qweight"],
2024-02-26T14:33:29.975208532Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 94.12 MiB is free. Process 3657454 has 79.04 GiB memory in use. Of the allocated memory 70.74 GiB is allocated by PyTorch, and 168.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-02-26T14:33:39.526605087Z

ashleyk•14mo ago

Try this: https://github.com/runpod-workers/worker-vllm/issues/49

GitHub

Cannot run Mixtral 8x7B Instruct AWQ · Issue #49 · runpod-workers/w...

I have successfully been able to run mistral/Mistral-7b-Instruct in both original and quantized (awq) format on runpod serverless using this repo. However, when I try to run Mixtral AWQ, I simply g...

octopusOP•14mo ago

Yes I tried the GPTQ and the AWQ both even the one mentioned by the poster there but it doesn't seem to work on Serverless.

Concept•14mo ago

I would recommend using exllama2 for loading up mixtral

ashleyk•14mo ago

In serverless?

Concept•14mo ago

Yes, vllm is still super buggy with quantizations and there's no cost effective way of running the full mixtral model 5bit variant using 33gb of vRAM

ashleyk•14mo ago

I have this, but haven't managed to get streaming working: https://github.com/ashleykleynhans/runpod-worker-exllamav2

GitHub

GitHub - ashleykleynhans/runpod-worker-exllamav2: RunPod Serverless...

RunPod Serverless worker for ExllamaV2. Contribute to ashleykleynhans/runpod-worker-exllamav2 development by creating an account on GitHub.

Gaming

Programming

Can we add minimum GPU configs required for running the popular models like Mistral, Mixtral?

Did you find this page helpful?