Can I run Mixtral 8x22B (INT8 (8-bit Integer)) on NVIDIA H100 SXM?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
80.0GB
Required
141.0GB
Headroom
-61.0GB

VRAM Usage

0GB 100% used 80.0GB

info Technical Analysis

The NVIDIA H100 SXM, while a powerful GPU, falls short of the VRAM requirements for running the Mixtral 8x22B (141.00B) model, even with INT8 quantization. Mixtral 8x22B, with its 141 billion parameters, necessitates 141GB of VRAM when quantized to INT8 precision. The H100 SXM offers 80GB of HBM3 memory, resulting in a significant deficit of 61GB. This VRAM limitation prevents the entire model from residing on the GPU, leading to out-of-memory errors and making direct inference impossible.

While the H100's impressive memory bandwidth of 3.35 TB/s and substantial compute capabilities (16896 CUDA cores and 528 Tensor cores) would typically enable fast inference, the VRAM bottleneck overrides these advantages in this specific scenario. Without sufficient VRAM, the model cannot be efficiently loaded and processed, negating the potential performance benefits of the H100's architecture. The estimated tokens/sec and batch size are therefore unavailable as the model cannot run on the specified hardware configuration.

lightbulb Recommendation

Due to the VRAM constraints, directly running Mixtral 8x22B on a single H100 SXM is not feasible. Consider using a multi-GPU setup with tensor parallelism to distribute the model across multiple GPUs, effectively increasing the available VRAM. Alternatively, explore more aggressive quantization techniques, such as Q4 or even lower precisions, but be aware that this can impact model accuracy. Another option is to use CPU offloading, where parts of the model are processed on the CPU, but this will significantly reduce inference speed. Finally, investigate distillation techniques to create a smaller, more manageable model that fits within the H100's VRAM.

tune Recommended Settings

Batch_Size
1 (if using CPU offloading or extremely aggressiv…
Context_Length
Reduce context length if possible to minimize VRA…
Other_Settings
['Enable tensor parallelism across multiple GPUs', 'Use CPU offloading as a last resort', 'Optimize attention mechanisms for reduced memory footprint']
Inference_Framework
vLLM (for multi-GPU parallelism) or llama.cpp (fo…
Quantization_Suggested
Q4_K_M or lower (with caution for accuracy loss)

help Frequently Asked Questions

Is Mixtral 8x22B (141.00B) compatible with NVIDIA H100 SXM? expand_more
No, the H100 SXM's 80GB VRAM is insufficient to load the 141GB INT8 quantized Mixtral 8x22B model.
What VRAM is needed for Mixtral 8x22B (141.00B)? expand_more
Mixtral 8x22B requires at least 282GB VRAM for FP16, 141GB for INT8, and lower amounts for more aggressive quantization methods like Q4 or Q2.
How fast will Mixtral 8x22B (141.00B) run on NVIDIA H100 SXM? expand_more
Due to the VRAM limitation, Mixtral 8x22B cannot run directly on the H100 SXM without employing techniques like multi-GPU parallelism, aggressive quantization, or CPU offloading, which will impact performance. Without these, it will not run at all.