Can I run Qwen 2.5 14B (q3_k_m) on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
5.6GB
Headroom
+74.4GB

VRAM Usage

0GB 7% used 80.0GB

Performance Estimate

Tokens/sec ~78.0
Batch size 26
Context 131072K

info Technical Analysis

The NVIDIA H100 PCIe, boasting 80GB of HBM2e VRAM and a 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Qwen 2.5 14B language model, especially when employing quantization techniques. The Qwen 2.5 14B model, in its q3_k_m quantized form, requires a mere 5.6GB of VRAM. This leaves a substantial 74.4GB of VRAM headroom on the H100, ensuring that even with larger context lengths or increased batch sizes, memory constraints are unlikely to be a bottleneck. The H100's Hopper architecture, with its 14592 CUDA cores and 456 Tensor Cores, provides ample computational resources for efficient inference, leading to high throughput and low latency.

lightbulb Recommendation

For optimal performance with the Qwen 2.5 14B model on the H100, leverage inference frameworks like `vLLM` or `text-generation-inference` which are optimized for NVIDIA GPUs and support advanced features such as continuous batching and tensor parallelism. While q3_k_m quantization is efficient, experimenting with higher precision quantization levels (e.g., q4_k_m or even FP16 if memory allows) could potentially improve output quality at the cost of slightly reduced throughput. Given the ample VRAM, explore increasing the context length to fully utilize the model's 131072 token capability and experiment with larger batch sizes to maximize GPU utilization and overall throughput.

tune Recommended Settings

Batch_Size
26
Context_Length
131072
Other_Settings
['Enable CUDA graph capture', 'Use Paged Attention', 'Optimize tensor core usage']
Inference_Framework
vLLM or text-generation-inference
Quantization_Suggested
q3_k_m (or higher if VRAM allows)

help Frequently Asked Questions

Is Qwen 2.5 14B (14.00B) compatible with NVIDIA H100 PCIe? expand_more
Yes, Qwen 2.5 14B is perfectly compatible with the NVIDIA H100 PCIe.
What VRAM is needed for Qwen 2.5 14B (14.00B)? expand_more
In its q3_k_m quantized form, Qwen 2.5 14B requires approximately 5.6GB of VRAM.
How fast will Qwen 2.5 14B (14.00B) run on NVIDIA H100 PCIe? expand_more
You can expect approximately 78 tokens/sec with the specified configuration. Performance may vary depending on the specific inference framework and optimization techniques used.