R
RunPod5mo ago
octopus

Can we add minimum GPU configs required for running the popular models like Mistral, Mixtral?

I'm trying to find what serverless GPU configs are required to run Mixtral 8x7B-Instruct either quantized (https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ) or the main from Mistral. It would be good to have this info in the ReadMe in vLLM Worker Repo. I run into OutOfMemory issues when trying it on 48GB GPU.
11 Replies
ashleyk
ashleyk5mo ago
Configs where? I assume this is for the vllm worker? For main Mixtral, non quant you need at least 2 x A100.
octopus
octopus5mo ago
Right using vLLM worker. And do you know about gptq quantized Mixtral? Thanks!
ashleyk
ashleyk5mo ago
No, don't know, but here is how much you need for the main model: https://www.youtube.com/watch?v=WjiX3lCnwUI
Matthew Berman
YouTube
Mixtral 8x7B DESTROYS Other Models (MoE = AGI?)
MistralAI is at it again. They've released an MoE (mixture of experts) model that completely dominates the open-source world. Here's a breakdown of what they released, plus an installation guide and an LLM test. * Sorry for the part where my face gets blurry Download the EdrawMind for Free:https://bit.ly/46xIp8G and SAVE UP TO 40% discount he...
ashleyk
ashleyk5mo ago
@Alpay Ariyak may be able to advise since he maintains the vllm worker for RunPod.
octopus
octopus5mo ago
@Alpay Ariyak I tried using A100 with 80GB GPU on serverless for quantized Mixtral but still out of memory errors:
...
2024-02-26T14:33:29.975180031Z File "/vllm-installation/vllm/model_executor/layers/quantization/gptq.py", line 205, in apply_weights
2024-02-26T14:33:29.975202231Z output = ops.gptq_gemm(reshaped_x, weights["qweight"],
2024-02-26T14:33:29.975208532Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 94.12 MiB is free. Process 3657454 has 79.04 GiB memory in use. Of the allocated memory 70.74 GiB is allocated by PyTorch, and 168.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-02-26T14:33:39.526605087Z
...
2024-02-26T14:33:29.975180031Z File "/vllm-installation/vllm/model_executor/layers/quantization/gptq.py", line 205, in apply_weights
2024-02-26T14:33:29.975202231Z output = ops.gptq_gemm(reshaped_x, weights["qweight"],
2024-02-26T14:33:29.975208532Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 94.12 MiB is free. Process 3657454 has 79.04 GiB memory in use. Of the allocated memory 70.74 GiB is allocated by PyTorch, and 168.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-02-26T14:33:39.526605087Z
ashleyk
ashleyk5mo ago
GitHub
Cannot run Mixtral 8x7B Instruct AWQ · Issue #49 · runpod-workers/w...
I have successfully been able to run mistral/Mistral-7b-Instruct in both original and quantized (awq) format on runpod serverless using this repo. However, when I try to run Mixtral AWQ, I simply g...
octopus
octopus5mo ago
Yes I tried the GPTQ and the AWQ both even the one mentioned by the poster there but it doesn't seem to work on Serverless.
Concept
Concept5mo ago
I would recommend using exllama2 for loading up mixtral
ashleyk
ashleyk5mo ago
In serverless?
Concept
Concept5mo ago
Yes, vllm is still super buggy with quantizations and there's no cost effective way of running the full mixtral model 5bit variant using 33gb of vRAM
ashleyk
ashleyk5mo ago
I have this, but haven't managed to get streaming working: https://github.com/ashleykleynhans/runpod-worker-exllamav2
GitHub
GitHub - ashleykleynhans/runpod-worker-exllamav2: RunPod Serverless...
RunPod Serverless worker for ExllamaV2. Contribute to ashleykleynhans/runpod-worker-exllamav2 development by creating an account on GitHub.