Can I run Mixtral 8x22B (INT8 (8-bit Integer)) on NVIDIA H100 PCIe?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
80.0GB
Required
141.0GB
Headroom
-61.0GB

VRAM Usage

0GB 100% used 80.0GB

info Technical Analysis

The NVIDIA H100 PCIe, while a powerful GPU, falls short of the VRAM requirements for running Mixtral 8x22B (141.00B) even with INT8 quantization. Mixtral 8x22B, a large language model with 141 billion parameters, necessitates substantial memory to load the model and perform inference. In INT8 quantization, the model requires 141GB of VRAM. The H100 PCIe provides only 80GB, resulting in a deficit of 61GB. This VRAM shortage prevents the model from being loaded onto the GPU for inference. The H100's impressive memory bandwidth of 2.0 TB/s would be beneficial if the model could fit, enabling rapid data transfer between the GPU and memory, but it's irrelevant in this scenario due to insufficient VRAM.

Even if techniques like CPU offloading were attempted, the performance would be severely degraded due to the slower transfer speeds between the CPU and GPU compared to the H100's internal memory bandwidth. The H100's Tensor Cores would also be underutilized since the model cannot be fully loaded. While the H100 is designed for high-throughput, low-latency inference, the VRAM limitation becomes the primary bottleneck, negating the benefits of its other features. The estimated tokens/sec and batch size are therefore unavailable, as the model cannot be run.

lightbulb Recommendation

Given the VRAM limitations, running Mixtral 8x22B (141.00B) on a single NVIDIA H100 PCIe is not feasible. Consider using multiple GPUs with sufficient combined VRAM to accommodate the model. Another option is to explore more aggressive quantization techniques, such as 4-bit quantization (QLoRA), which would reduce the VRAM footprint but may also impact accuracy. Alternatively, consider using a cloud-based solution that offers instances with larger GPU memory, or utilize model parallelism across multiple GPUs if possible.

If you are set on using the H100, explore smaller models or fine-tune a smaller model to achieve similar results. Another approach would be to use a model sharding library like DeepSpeed or FairScale to distribute the model across multiple H100 GPUs if you have access to more than one. This requires significant technical expertise and careful configuration to optimize inter-GPU communication.

tune Recommended Settings

Batch_Size
1 (if QLoRA is used and model fits)
Context_Length
Adjust based on available VRAM after quantization
Other_Settings
['Enable CUDA graphs', 'Use paged attention', 'Optimize tensor parallelism if using multiple GPUs']
Inference_Framework
vLLM
Quantization_Suggested
QLoRA (4-bit)

help Frequently Asked Questions

Is Mixtral 8x22B (141.00B) compatible with NVIDIA H100 PCIe? expand_more
No, Mixtral 8x22B (141.00B) is not directly compatible with a single NVIDIA H100 PCIe due to insufficient VRAM.
What VRAM is needed for Mixtral 8x22B (141.00B)? expand_more
Mixtral 8x22B (141.00B) requires approximately 141GB of VRAM when quantized to INT8. FP16 requires 282GB.
How fast will Mixtral 8x22B (141.00B) run on NVIDIA H100 PCIe? expand_more
Mixtral 8x22B (141.00B) will not run on a single NVIDIA H100 PCIe without significant modifications like aggressive quantization or sharding across multiple GPUs due to the VRAM limitation. Therefore, an estimate of tokens/sec is not available for this configuration.