Can I run Qwen 2.5 14B on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
28.0GB
Headroom
+52.0GB

VRAM Usage

0GB 35% used 80.0GB

Performance Estimate

Tokens/sec ~78.0
Batch size 18
Context 131072K

info Technical Analysis

The NVIDIA H100 PCIe, with its 80GB of HBM2e memory and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Qwen 2.5 14B model. The model, requiring 28GB of VRAM in FP16 precision, leaves a substantial 52GB of VRAM headroom on the H100. This ample headroom allows for larger batch sizes, longer context lengths, and potentially running multiple model instances concurrently. The H100's Hopper architecture, featuring 14592 CUDA cores and 456 Tensor Cores, provides the necessary computational power for efficient inference, ensuring low latency and high throughput. The high memory bandwidth is crucial for rapidly transferring model weights and activations, further enhancing performance.

lightbulb Recommendation

Given the significant VRAM headroom, experiment with increasing the batch size to maximize throughput. Start with the estimated batch size of 18 and incrementally increase it until you observe diminishing returns or run into memory constraints. Also, explore using a context length close to the model's maximum of 131072 tokens to leverage the full capabilities of Qwen 2.5. Consider using quantization techniques (e.g., int8 or even lower precision) to further improve performance and reduce memory footprint, although the H100's ample VRAM may make this unnecessary for a single model instance. For optimal performance, utilize inference frameworks like vLLM or NVIDIA's TensorRT, which are designed to leverage the H100's hardware acceleration features.

tune Recommended Settings

Batch_Size
18 (start and increase gradually)
Context_Length
Up to 131072
Other_Settings
['Enable CUDA graph capture', 'Use Pytorch FSDP for multi-GPU (if scaling)', 'Profile performance with NVIDIA Nsight Systems']
Inference_Framework
vLLM or NVIDIA TensorRT
Quantization_Suggested
int8 (optional, for further speedup)

help Frequently Asked Questions

Is Qwen 2.5 14B (14.00B) compatible with NVIDIA H100 PCIe? expand_more
Yes, the NVIDIA H100 PCIe is perfectly compatible with Qwen 2.5 14B, offering ample VRAM and computational resources.
What VRAM is needed for Qwen 2.5 14B (14.00B)? expand_more
Qwen 2.5 14B requires approximately 28GB of VRAM when using FP16 precision.
How fast will Qwen 2.5 14B (14.00B) run on NVIDIA H100 PCIe? expand_more
You can expect an estimated throughput of around 78 tokens per second on the NVIDIA H100 PCIe. Actual performance may vary depending on batch size, context length, and inference framework used.