Can I run Phi-3 Medium 14B (q3_k_m) on NVIDIA H100 SXM?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
5.6GB
Headroom
+74.4GB

VRAM Usage

0GB 7% used 80.0GB

Performance Estimate

Tokens/sec ~90.0
Batch size 26
Context 128000K

info Technical Analysis

The NVIDIA H100 SXM, with its substantial 80GB of HBM3 VRAM and 3.35 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Medium 14B model, especially when using quantization. Phi-3 Medium 14B in its q3_k_m quantized form requires only 5.6GB of VRAM, leaving a significant 74.4GB of headroom on the H100. This ample VRAM allows for large batch sizes and extended context lengths without encountering memory constraints. The H100's Hopper architecture, featuring 16896 CUDA cores and 528 Tensor cores, is designed for accelerating deep learning workloads, ensuring efficient computation for inference tasks.

Given the model size and the GPU's capabilities, the primary performance bottleneck will likely be memory bandwidth rather than compute. While the H100's 3.35 TB/s bandwidth is substantial, optimizing data transfer between the GPU and system memory is crucial for maximizing throughput. Quantization significantly reduces the memory footprint and bandwidth requirements, enabling faster inference speeds. The estimated tokens/sec of 90 is a reasonable expectation, but actual performance will depend on the specific inference framework used and the degree of optimization applied.

lightbulb Recommendation

For optimal performance, leverage an inference framework like vLLM or NVIDIA's TensorRT, which are designed to exploit the H100's architecture and support efficient quantization. Experiment with different batch sizes to find the sweet spot between throughput and latency; a batch size of 26 is a good starting point. Ensure that your data pipeline is optimized to minimize CPU-GPU data transfers. Consider using techniques like CUDA graphs to further reduce overhead. If you encounter performance limitations, profile your application to identify bottlenecks and adjust accordingly.

While q3_k_m quantization is effective for reducing VRAM usage, explore other quantization levels (e.g., q4_k_m) to potentially improve accuracy with a slight increase in memory footprint. Always validate the accuracy of the quantized model against a representative dataset to ensure that the quantization process does not significantly degrade performance. Monitor GPU utilization and temperature to ensure the H100 is operating within its thermal limits, especially given its 700W TDP.

tune Recommended Settings

Batch_Size
26
Context_Length
128000
Other_Settings
['Enable CUDA graphs', 'Optimize data pipeline for minimal CPU-GPU transfer', 'Use Tensor Cores for FP16 or BF16 acceleration', 'Profile application to identify bottlenecks']
Inference_Framework
vLLM or NVIDIA TensorRT
Quantization_Suggested
q3_k_m (or experiment with q4_k_m)

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA H100 SXM? expand_more
Yes, Phi-3 Medium 14B is fully compatible with the NVIDIA H100 SXM, even with significant VRAM headroom.
What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more
With q3_k_m quantization, Phi-3 Medium 14B requires approximately 5.6GB of VRAM.
How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA H100 SXM? expand_more
You can expect around 90 tokens/sec, but actual performance depends on the inference framework and optimization level.