The NVIDIA H100 PCIe, while a powerful GPU, falls short of the VRAM requirements for running Mixtral 8x22B (141.00B) even with INT8 quantization. Mixtral 8x22B, a large language model with 141 billion parameters, necessitates substantial memory to load the model and perform inference. In INT8 quantization, the model requires 141GB of VRAM. The H100 PCIe provides only 80GB, resulting in a deficit of 61GB. This VRAM shortage prevents the model from being loaded onto the GPU for inference. The H100's impressive memory bandwidth of 2.0 TB/s would be beneficial if the model could fit, enabling rapid data transfer between the GPU and memory, but it's irrelevant in this scenario due to insufficient VRAM.
Even if techniques like CPU offloading were attempted, the performance would be severely degraded due to the slower transfer speeds between the CPU and GPU compared to the H100's internal memory bandwidth. The H100's Tensor Cores would also be underutilized since the model cannot be fully loaded. While the H100 is designed for high-throughput, low-latency inference, the VRAM limitation becomes the primary bottleneck, negating the benefits of its other features. The estimated tokens/sec and batch size are therefore unavailable, as the model cannot be run.
Given the VRAM limitations, running Mixtral 8x22B (141.00B) on a single NVIDIA H100 PCIe is not feasible. Consider using multiple GPUs with sufficient combined VRAM to accommodate the model. Another option is to explore more aggressive quantization techniques, such as 4-bit quantization (QLoRA), which would reduce the VRAM footprint but may also impact accuracy. Alternatively, consider using a cloud-based solution that offers instances with larger GPU memory, or utilize model parallelism across multiple GPUs if possible.
If you are set on using the H100, explore smaller models or fine-tune a smaller model to achieve similar results. Another approach would be to use a model sharding library like DeepSpeed or FairScale to distribute the model across multiple H100 GPUs if you have access to more than one. This requires significant technical expertise and careful configuration to optimize inter-GPU communication.