Can I run Mixtral 8x22B (Q4_K_M (GGUF 4-bit)) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
70.5GB
Headroom
-46.5GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM, falls short of the VRAM requirements for running the Mixtral 8x22B (141B) model, even in its Q4_K_M (4-bit) quantized form. The quantized model requires approximately 70.5GB of VRAM, resulting in a significant VRAM deficit of 46.5GB. This means the entire model cannot be loaded onto the GPU for inference. While the RTX 3090 Ti boasts a high memory bandwidth of 1.01 TB/s and a substantial number of CUDA and Tensor cores, these resources become irrelevant if the model cannot fit into the available VRAM.

Attempting to run the model without sufficient VRAM will lead to out-of-memory errors. Even if offloading some layers to system RAM were possible, the performance would be severely impacted due to the slower transfer speeds between the GPU and system memory. The limited VRAM also restricts the achievable batch size and context length, further hindering performance. The Ampere architecture of the RTX 3090 Ti is well-suited for AI tasks, but its VRAM capacity is the limiting factor in this scenario.

lightbulb Recommendation

Given the VRAM constraints, running Mixtral 8x22B (141B) on a single RTX 3090 Ti is not feasible. Consider exploring distributed inference solutions that utilize multiple GPUs to pool their VRAM resources. Alternatively, explore smaller language models that fit within the RTX 3090 Ti's VRAM capacity. Another option is to use cloud-based inference services that offer access to GPUs with larger VRAM configurations. Fine-tuning a smaller, more efficient model for your specific use case could also be a viable path forward.

If you are set on running Mixtral 8x22B locally, consider a CPU-based inference setup using llama.cpp. While slower than GPU inference, it can bypass the VRAM limitation. However, performance will be significantly lower. Explore extreme quantization methods, but be aware that aggressive quantization can impact model accuracy.

tune Recommended Settings

Batch_Size
N/A (Model doesn't fit)
Context_Length
N/A (Model doesn't fit)
Other_Settings
['Offload to CPU (extremely slow)', 'Consider smaller models']
Inference_Framework
llama.cpp (for CPU inference)
Quantization_Suggested
No change feasible on RTX 3090 Ti

help Frequently Asked Questions

Is Mixtral 8x22B (141.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, the Mixtral 8x22B model requires more VRAM than the NVIDIA RTX 3090 Ti provides.
What VRAM is needed for Mixtral 8x22B (141.00B)? expand_more
The Q4_K_M quantized version of Mixtral 8x22B requires approximately 70.5GB of VRAM.
How fast will Mixtral 8x22B (141.00B) run on NVIDIA RTX 3090 Ti? expand_more
It will not run on the RTX 3090 Ti due to insufficient VRAM. Offloading to CPU is possible but will result in very slow performance.