Can I run Mixtral 8x22B (INT8 (8-bit Integer)) on NVIDIA RTX 4090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
141.0GB
Headroom
-117.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 4090, while a powerful GPU, falls short in VRAM capacity for running the Mixtral 8x22B (141B) model, even with INT8 quantization. Mixtral 8x22B requires approximately 141GB of VRAM when quantized to INT8. The RTX 4090 offers only 24GB of VRAM, resulting in a significant deficit of 117GB. This discrepancy means the entire model cannot reside on the GPU's memory, leading to out-of-memory errors and preventing successful inference.

Even if techniques like CPU offloading were attempted, the memory bandwidth limitations between the GPU and system RAM (connected via PCIe) would severely bottleneck performance. The RTX 4090's 1.01 TB/s memory bandwidth is excellent for models that fit within its VRAM, but insufficient to compensate for the massive data transfers required when parts of the model are stored in slower system memory. The 16384 CUDA cores and 512 Tensor cores of the RTX 4090 are rendered largely ineffective in this scenario due to the VRAM constraint. Therefore, achieving reasonable inference speeds or even running the model at all is highly improbable.

lightbulb Recommendation

Due to the substantial VRAM requirements of Mixtral 8x22B, running it directly on an RTX 4090 is not feasible. Consider using cloud-based inference services that offer GPUs with sufficient VRAM, such as those available on platforms like Google Cloud, AWS, or Azure. Alternatively, explore model parallelism techniques where the model is split across multiple GPUs, each with adequate VRAM to hold a portion of the model. This necessitates a multi-GPU setup, which is a more complex but potentially viable solution.

If cloud solutions or multi-GPU setups are not possible, consider using smaller models that fit within the RTX 4090's VRAM, such as smaller quantized versions of other LLMs. Fine-tuning a smaller model on a relevant dataset can provide a more practical solution with the available hardware. Also, explore extreme quantization techniques like 4-bit or even 2-bit quantization, but be aware that this can significantly impact the model's accuracy and performance. However, they might allow for some degree of experimentation.

tune Recommended Settings

Batch_Size
1
Context_Length
Reduce to the smallest usable context length
Other_Settings
['Enable CPU offloading as a last resort, but expect very slow performance', 'Use a smaller model instead']
Inference_Framework
llama.cpp (with extreme quantization) or potentia…
Quantization_Suggested
q4_K_M or even smaller (q2_K)

help Frequently Asked Questions

Is Mixtral 8x22B (141.00B) compatible with NVIDIA RTX 4090? expand_more
No, the RTX 4090 does not have enough VRAM to run Mixtral 8x22B, even with INT8 quantization.
What VRAM is needed for Mixtral 8x22B (141.00B)? expand_more
Mixtral 8x22B requires approximately 141GB of VRAM when quantized to INT8.
How fast will Mixtral 8x22B (141.00B) run on NVIDIA RTX 4090? expand_more
It is unlikely to run at all due to insufficient VRAM. Even with extreme quantization and CPU offloading, performance will be severely limited and likely unusable.