The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, faces a significant challenge when running the Mixtral 8x22B (141.00B) model, even in its INT8 quantized form. The model requires approximately 141GB of VRAM in INT8, far exceeding the 3090's capacity. This VRAM deficit of 117GB means the entire model cannot reside on the GPU simultaneously. Consequently, direct inference is impossible without employing techniques to offload parts of the model to system RAM or other GPUs.
Even if VRAM limitations were somehow bypassed, the RTX 3090's memory bandwidth of 0.94 TB/s could become a bottleneck. Large language models like Mixtral 8x22B demand rapid data transfer between memory and compute units. Offloading model layers to system RAM, which has significantly lower bandwidth than GDDR6X, would drastically reduce inference speed. The 328 Tensor Cores on the RTX 3090 are capable of accelerating matrix multiplications, but their utilization will be hampered by the VRAM constraint. Without sufficient VRAM, estimated tokens per second and batch size cannot be determined, as the model will likely fail to load or run at a reasonable speed.
Given the substantial VRAM shortfall, running Mixtral 8x22B on a single RTX 3090 is impractical without significant compromises. Model parallelism across multiple GPUs, where the model is split and distributed, is the most viable option. Alternatively, consider using cloud-based GPU instances with sufficient VRAM, such as those offered by NelsaHost, or exploring smaller language models that fit within the RTX 3090's memory capacity.
If you must attempt to run Mixtral 8x22B on the RTX 3090, investigate extreme quantization techniques like 4-bit quantization (INT4 or NF4) or even 2-bit quantization if available. However, be aware that aggressive quantization can noticeably degrade model accuracy. Also, explore CPU offloading or disk offloading, but expect a severe performance penalty.