The NVIDIA H100 SXM, with its 80GB of HBM3 memory and 3.35 TB/s bandwidth, is a powerhouse for AI workloads. However, the Mixtral 8x7B model, even in FP16 precision, requires approximately 93.4GB of VRAM. This immediately presents a problem, as the H100 SXM falls short by 13.4GB. The model's size stems from its architecture: a Mixture of Experts (MoE) model with eight experts, each with 7 billion parameters, leading to a large overall memory footprint. While the H100's Tensor Cores (528) would normally accelerate matrix multiplications crucial for LLM inference, the VRAM limitation prevents the model from loading entirely onto the GPU.
Unfortunately, running Mixtral 8x7B in FP16 precision on a single H100 SXM is not feasible due to insufficient VRAM. Your options are limited to either using multiple GPUs, employing aggressive quantization techniques like 4-bit or even 3-bit quantization, or using CPU offloading. Quantization will significantly reduce the model's memory footprint but may impact accuracy. CPU offloading involves storing parts of the model in system RAM and transferring them to the GPU as needed, which will drastically reduce inference speed. Another alternative is to use a smaller model or a distilled version of Mixtral if the task allows.