Can I run Mixtral 8x22B (INT8 (8-bit Integer)) on NVIDIA RTX 3090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
141.0GB
Headroom
-117.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, faces a significant challenge when running the Mixtral 8x22B (141.00B) model, even in its INT8 quantized form. The model requires approximately 141GB of VRAM in INT8, far exceeding the 3090's capacity. This VRAM deficit of 117GB means the entire model cannot reside on the GPU simultaneously. Consequently, direct inference is impossible without employing techniques to offload parts of the model to system RAM or other GPUs.

Even if VRAM limitations were somehow bypassed, the RTX 3090's memory bandwidth of 0.94 TB/s could become a bottleneck. Large language models like Mixtral 8x22B demand rapid data transfer between memory and compute units. Offloading model layers to system RAM, which has significantly lower bandwidth than GDDR6X, would drastically reduce inference speed. The 328 Tensor Cores on the RTX 3090 are capable of accelerating matrix multiplications, but their utilization will be hampered by the VRAM constraint. Without sufficient VRAM, estimated tokens per second and batch size cannot be determined, as the model will likely fail to load or run at a reasonable speed.

lightbulb Recommendation

Given the substantial VRAM shortfall, running Mixtral 8x22B on a single RTX 3090 is impractical without significant compromises. Model parallelism across multiple GPUs, where the model is split and distributed, is the most viable option. Alternatively, consider using cloud-based GPU instances with sufficient VRAM, such as those offered by NelsaHost, or exploring smaller language models that fit within the RTX 3090's memory capacity.

If you must attempt to run Mixtral 8x22B on the RTX 3090, investigate extreme quantization techniques like 4-bit quantization (INT4 or NF4) or even 2-bit quantization if available. However, be aware that aggressive quantization can noticeably degrade model accuracy. Also, explore CPU offloading or disk offloading, but expect a severe performance penalty.

tune Recommended Settings

Batch_Size
1 (or as low as possible)
Context_Length
Reduce context length to the minimum acceptable v…
Other_Settings
['Enable CPU offloading (expect very slow performance)', 'Use a swap file on a fast SSD', 'Try model parallelism across multiple machines (if available)']
Inference_Framework
llama.cpp (for CPU offloading), vLLM (for multi-G…
Quantization_Suggested
INT4 or NF4 (if supported and accuracy is accepta…

help Frequently Asked Questions

Is Mixtral 8x22B (141.00B) compatible with NVIDIA RTX 3090? expand_more
No, the Mixtral 8x22B model is not directly compatible with the NVIDIA RTX 3090 due to insufficient VRAM.
What VRAM is needed for Mixtral 8x22B (141.00B)? expand_more
Mixtral 8x22B (141.00B) requires approximately 141GB of VRAM when quantized to INT8. FP16 would require 282GB of VRAM.
How fast will Mixtral 8x22B (141.00B) run on NVIDIA RTX 3090? expand_more
Due to the VRAM limitation, the model will likely not run at all or will run extremely slowly with CPU offloading, making it impractical for most use cases. Performance will be severely degraded.