Can I run Mistral Large 2 on NVIDIA H100 SXM?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
80.0GB
Required
246.0GB
Headroom
-166.0GB

VRAM Usage

0GB 100% used 80.0GB

info Technical Analysis

The NVIDIA H100 SXM, while a powerful GPU, falls short of the VRAM requirements for running Mistral Large 2 in FP16 precision. Mistral Large 2, with its 123 billion parameters, necessitates 246GB of VRAM for FP16 inference. The H100 SXM only provides 80GB of HBM3 memory. This 166GB deficit means the entire model cannot reside on the GPU simultaneously, leading to out-of-memory errors or requiring offloading to system RAM, which drastically reduces performance. Memory bandwidth, while substantial on the H100 (3.35 TB/s), becomes less relevant when the model must constantly swap data between GPU and system memory due to insufficient VRAM.

Even with the H100's impressive 16896 CUDA cores and 528 Tensor Cores, the VRAM bottleneck will prevent efficient utilization of these resources. Without sufficient VRAM, the model will be severely limited in the batch sizes it can process, and the context length may need to be significantly reduced to even get the model running. This will result in extremely low tokens/second generation speed, making real-time or interactive applications impractical. The Hopper architecture's advanced features cannot compensate for the fundamental lack of memory capacity.

lightbulb Recommendation

To run Mistral Large 2 effectively, you'll need to significantly reduce its memory footprint. Quantization is the primary method. Consider using techniques like 4-bit or 8-bit quantization to compress the model. Frameworks like `llama.cpp` or `text-generation-inference` offer excellent quantization support and CPU offloading capabilities. Even with quantization, performance will be limited compared to running the model on a GPU with sufficient VRAM.

Alternatively, explore distributed inference solutions that split the model across multiple GPUs. This requires significant technical expertise and infrastructure. If neither of these options is feasible, consider using a smaller model or accessing Mistral Large 2 via a cloud-based API, which handles the hardware and optimization complexities for you.

tune Recommended Settings

Batch_Size
Start with 1 and increase gradually until VRAM is…
Context_Length
Reduce to 2048 or lower, experiment to find a bal…
Other_Settings
['Enable CPU offloading if necessary', 'Utilize attention mechanisms optimizations like FlashAttention']
Inference_Framework
llama.cpp or text-generation-inference
Quantization_Suggested
4-bit or 8-bit (Q4_K_M or Q8_0)

help Frequently Asked Questions

Is Mistral Large 2 compatible with NVIDIA H100 SXM? expand_more
No, not without significant quantization and optimization due to VRAM limitations.
What VRAM is needed for Mistral Large 2? expand_more
A minimum of 246GB of VRAM is required for FP16 precision. Quantization can reduce this requirement.
How fast will Mistral Large 2 run on NVIDIA H100 SXM? expand_more
Expect significantly reduced performance due to VRAM constraints. Tokens/second will be low, and batch sizes will be limited. Quantization and CPU offloading can improve performance, but it will still be far from optimal.