Can I run Qwen 2.5 72B (q3_k_m) on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
28.8GB
Headroom
+51.2GB

VRAM Usage

0GB 36% used 80.0GB

Performance Estimate

Tokens/sec ~31.0
Batch size 3
Context 131072K

info Technical Analysis

The NVIDIA H100 PCIe, with its substantial 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Qwen 2.5 72B model, especially when using quantization. The Qwen 2.5 72B model, in its q3_k_m quantized form, requires only 28.8GB of VRAM. This leaves a significant 51.2GB of VRAM headroom on the H100, allowing for larger batch sizes or concurrent model serving. The H100's Hopper architecture, with its 14592 CUDA cores and 456 Tensor Cores, provides ample computational power for efficient inference.

The high memory bandwidth of the H100 is crucial for minimizing latency when loading model weights and processing large context lengths like the 131072 tokens supported by Qwen 2.5. The estimated tokens/second of 31 indicates a respectable inference speed, further enhanced by the ability to use a batch size of 3. While higher precision inference (e.g., FP16) would significantly increase VRAM requirements, the q3_k_m quantization provides a good balance between accuracy and memory footprint, making it ideal for deployment on the H100.

It's also important to note that the H100's TDP of 350W suggests that it will require adequate cooling and power infrastructure to maintain optimal performance during sustained inference workloads. This power draw is manageable, especially considering the performance gains achieved with this powerful GPU.

lightbulb Recommendation

For optimal performance with Qwen 2.5 72B on the H100, use an inference framework like `llama.cpp` or `vLLM` that supports both quantization and efficient memory management. Start with the suggested batch size of 3 and experiment with slightly larger values to maximize throughput without exceeding the available VRAM. Monitor GPU utilization and temperature to ensure the H100 is operating within its thermal limits.

Consider exploring different quantization methods (e.g., q4_k_m or q5_k_m) if you need slightly better accuracy, but be mindful of the increased VRAM requirements. If you encounter performance bottlenecks, profile the inference process to identify the specific areas that need optimization, such as kernel execution or data transfer.

tune Recommended Settings

Batch_Size
3 (experiment with slightly larger values)
Context_Length
131072 tokens
Other_Settings
['Enable CUDA graph capture', 'Use asynchronous data loading', 'Optimize kernel fusion']
Inference_Framework
llama.cpp or vLLM
Quantization_Suggested
q3_k_m (or experiment with q4_k_m/q5_k_m)

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA H100 PCIe? expand_more
Yes, Qwen 2.5 72B is fully compatible with the NVIDIA H100 PCIe, especially with q3_k_m quantization.
What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more
With q3_k_m quantization, Qwen 2.5 72B requires approximately 28.8GB of VRAM.
How fast will Qwen 2.5 72B (72.00B) run on NVIDIA H100 PCIe? expand_more
You can expect an estimated throughput of around 31 tokens/second on the NVIDIA H100 PCIe with q3_k_m quantization and a batch size of 3.