The NVIDIA H100 SXM, while a powerful GPU, falls short of the VRAM requirements for running the Mixtral 8x22B (141.00B) model, even with INT8 quantization. Mixtral 8x22B, with its 141 billion parameters, necessitates 141GB of VRAM when quantized to INT8 precision. The H100 SXM offers 80GB of HBM3 memory, resulting in a significant deficit of 61GB. This VRAM limitation prevents the entire model from residing on the GPU, leading to out-of-memory errors and making direct inference impossible.
While the H100's impressive memory bandwidth of 3.35 TB/s and substantial compute capabilities (16896 CUDA cores and 528 Tensor cores) would typically enable fast inference, the VRAM bottleneck overrides these advantages in this specific scenario. Without sufficient VRAM, the model cannot be efficiently loaded and processed, negating the potential performance benefits of the H100's architecture. The estimated tokens/sec and batch size are therefore unavailable as the model cannot run on the specified hardware configuration.
Due to the VRAM constraints, directly running Mixtral 8x22B on a single H100 SXM is not feasible. Consider using a multi-GPU setup with tensor parallelism to distribute the model across multiple GPUs, effectively increasing the available VRAM. Alternatively, explore more aggressive quantization techniques, such as Q4 or even lower precisions, but be aware that this can impact model accuracy. Another option is to use CPU offloading, where parts of the model are processed on the CPU, but this will significantly reduce inference speed. Finally, investigate distillation techniques to create a smaller, more manageable model that fits within the H100's VRAM.