Can I run Qwen 2.5 14B (INT8 (8-bit Integer)) on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
14.0GB
Headroom
+66.0GB

VRAM Usage

0GB 18% used 80.0GB

Performance Estimate

Tokens/sec ~78.0
Batch size 23
Context 131072K

info Technical Analysis

The NVIDIA H100 PCIe, with its 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Qwen 2.5 14B model, especially when quantized to INT8. The INT8 quantization reduces the model's VRAM footprint to approximately 14GB, leaving a substantial 66GB of VRAM headroom. This ample headroom allows for larger batch sizes, longer context lengths, and the potential to run multiple model instances concurrently. The H100's Hopper architecture, boasting 14592 CUDA cores and 456 Tensor Cores, is designed for efficient matrix multiplication, a core operation in transformer models like Qwen 2.5, which should lead to very high throughput.

lightbulb Recommendation

Given the H100's capabilities and the model's size, focus on maximizing throughput by experimenting with larger batch sizes. Start with a batch size of 23, as estimated, and incrementally increase it until you observe diminishing returns in tokens/sec. Utilizing a context length of 131072 tokens is feasible, but monitor performance closely as longer context lengths can impact latency. For optimal performance, use a framework like vLLM or NVIDIA's TensorRT, which are designed to leverage the H100's architecture effectively. Consider further quantization to INT4 or even NF4 to potentially increase batch size and throughput further, but be mindful of potential accuracy trade-offs.

tune Recommended Settings

Batch_Size
23 (initial), experiment with higher values
Context_Length
131072
Other_Settings
['Enable CUDA graph capture', 'Use Paged Attention', 'Experiment with different attention mechanisms (e.g., FlashAttention)']
Inference_Framework
vLLM
Quantization_Suggested
INT4 or NF4 (optional, for higher throughput)

help Frequently Asked Questions

Is Qwen 2.5 14B (14.00B) compatible with NVIDIA H100 PCIe? expand_more
Yes, Qwen 2.5 14B is fully compatible with the NVIDIA H100 PCIe. The H100 has ample VRAM and compute power to run the model efficiently.
What VRAM is needed for Qwen 2.5 14B (14.00B)? expand_more
With INT8 quantization, Qwen 2.5 14B requires approximately 14GB of VRAM.
How fast will Qwen 2.5 14B (14.00B) run on NVIDIA H100 PCIe? expand_more
You can expect around 78 tokens/sec. Performance can be further optimized by adjusting batch size, context length, and using optimized inference frameworks like vLLM or TensorRT.