The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM, falls short of the VRAM requirements for running the Mixtral 8x22B (141B) model, even in its Q4_K_M (4-bit) quantized form. The quantized model requires approximately 70.5GB of VRAM, resulting in a significant VRAM deficit of 46.5GB. This means the entire model cannot be loaded onto the GPU for inference. While the RTX 3090 Ti boasts a high memory bandwidth of 1.01 TB/s and a substantial number of CUDA and Tensor cores, these resources become irrelevant if the model cannot fit into the available VRAM.
Attempting to run the model without sufficient VRAM will lead to out-of-memory errors. Even if offloading some layers to system RAM were possible, the performance would be severely impacted due to the slower transfer speeds between the GPU and system memory. The limited VRAM also restricts the achievable batch size and context length, further hindering performance. The Ampere architecture of the RTX 3090 Ti is well-suited for AI tasks, but its VRAM capacity is the limiting factor in this scenario.
Given the VRAM constraints, running Mixtral 8x22B (141B) on a single RTX 3090 Ti is not feasible. Consider exploring distributed inference solutions that utilize multiple GPUs to pool their VRAM resources. Alternatively, explore smaller language models that fit within the RTX 3090 Ti's VRAM capacity. Another option is to use cloud-based inference services that offer access to GPUs with larger VRAM configurations. Fine-tuning a smaller, more efficient model for your specific use case could also be a viable path forward.
If you are set on running Mixtral 8x22B locally, consider a CPU-based inference setup using llama.cpp. While slower than GPU inference, it can bypass the VRAM limitation. However, performance will be significantly lower. Explore extreme quantization methods, but be aware that aggressive quantization can impact model accuracy.