Can I run Mixtral 8x22B (q3_k_m) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
56.4GB
Headroom
-32.4GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, while a powerful GPU with 24GB of GDDR6X VRAM and a memory bandwidth of 1.01 TB/s, falls short of the VRAM requirements for running the Mixtral 8x22B (141.00B) model, even with quantization. Mixtral 8x22B, a large language model with 141 billion parameters, demands substantial memory. Quantization to q3_k_m reduces the model's footprint, but it still requires 56.4GB of VRAM. The RTX 3090 Ti's 24GB VRAM is insufficient, resulting in a VRAM deficit of 32.4GB. This means the entire model cannot be loaded onto the GPU, preventing successful inference. The high memory bandwidth of the RTX 3090 Ti would be beneficial if the model could fit, but it's irrelevant in this scenario due to the VRAM limitation.

Even if techniques like CPU offloading or NVMe swapping were employed, performance would be severely degraded, rendering the model practically unusable. CPU offloading involves moving some model layers to system RAM, which is much slower than VRAM. NVMe swapping involves using an NVMe SSD as an extension of VRAM, but the speed difference is still significant. The Ampere architecture's Tensor Cores would accelerate matrix multiplications if the model fit within the VRAM, but again, this potential is unrealized due to the memory constraint. Without sufficient VRAM, the model cannot be processed efficiently, and no reasonable tokens/sec or batch size can be achieved.

lightbulb Recommendation

Given the VRAM limitations of the RTX 3090 Ti, directly running Mixtral 8x22B (141.00B) is not feasible. Consider using a smaller model that fits within the 24GB VRAM, such as a quantized 7B or 13B parameter model. Alternatively, explore cloud-based solutions like NelsaHost or other services that offer access to GPUs with sufficient VRAM, such as those with 80GB or more, to run Mixtral 8x22B effectively.

If you are determined to run Mixtral 8x22B locally, investigate model parallelism across multiple GPUs, if your system supports it. This involves splitting the model across multiple GPUs, each holding a portion of the model's layers. However, this requires specialized software and significant system resources. Another option is to utilize CPU offloading or disk swapping, but be prepared for drastically reduced performance, making interactive use impractical. Prioritize cloud solutions or smaller models for practical use.

tune Recommended Settings

Batch_Size
N/A (cannot run effectively)
Context_Length
N/A (cannot run effectively)
Other_Settings
['Enable CPU offloading (expect very slow performance)', 'Explore model parallelism if multiple GPUs are available']
Inference_Framework
llama.cpp (for CPU offloading experiments), vLLM …
Quantization_Suggested
No change needed (q3_k_m is already aggressive)

help Frequently Asked Questions

Is Mixtral 8x22B (141.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, the RTX 3090 Ti's 24GB VRAM is insufficient to run Mixtral 8x22B, even with quantization.
What VRAM is needed for Mixtral 8x22B (141.00B)? expand_more
Mixtral 8x22B quantized to q3_k_m requires approximately 56.4GB of VRAM.
How fast will Mixtral 8x22B (141.00B) run on NVIDIA RTX 3090 Ti? expand_more
Mixtral 8x22B will not run on the RTX 3090 Ti due to insufficient VRAM. Performance will be effectively zero tokens/sec.