Can I run Phi-3 Medium 14B on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
28.0GB
Headroom
+52.0GB

VRAM Usage

0GB 35% used 80.0GB

Performance Estimate

Tokens/sec ~78.0
Batch size 18
Context 128000K

info Technical Analysis

The NVIDIA H100 PCIe, with its substantial 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Medium 14B model. Phi-3 Medium 14B, requiring 28GB of VRAM in FP16 precision, leaves a significant 52GB of VRAM headroom on the H100. This ample headroom not only ensures smooth operation but also allows for larger batch sizes and longer context lengths, maximizing throughput. The H100's 14592 CUDA cores and 456 Tensor Cores further contribute to efficient computation, accelerating both inference and training tasks.

The high memory bandwidth of the H100 is crucial for feeding the GPU cores with the necessary data, preventing bottlenecks and ensuring optimal utilization of the available compute resources. This is particularly important for large language models like Phi-3 Medium 14B, which are memory-intensive. The combination of abundant VRAM and high memory bandwidth enables the H100 to handle the model's parameters and activations with ease, leading to faster inference times and improved overall performance. The Hopper architecture provides additional optimizations for transformer models, further enhancing the efficiency of the setup.

lightbulb Recommendation

Given the H100's capabilities, users should aim to maximize batch size and context length to fully utilize the available resources. Experiment with different inference frameworks like vLLM or text-generation-inference to find the best balance between latency and throughput. Quantization to INT8 or even lower precisions could further improve performance without significant loss in accuracy, allowing for even larger batch sizes. However, FP16 should provide excellent performance and quality to start with. Monitor GPU utilization to ensure the H100 is being fully utilized; if not, increase batch size or context length until utilization plateaus.

tune Recommended Settings

Batch_Size
18 (start point, increase until GPU utilization i…
Context_Length
128000
Other_Settings
['Enable CUDA graph capture', 'Use TensorRT for further optimization', 'Experiment with different attention mechanisms']
Inference_Framework
vLLM or text-generation-inference
Quantization_Suggested
INT8 (optional, for further performance gains)

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA H100 PCIe? expand_more
Yes, Phi-3 Medium 14B is fully compatible with the NVIDIA H100 PCIe, with ample VRAM and processing power.
What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more
Phi-3 Medium 14B requires approximately 28GB of VRAM when using FP16 precision.
How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA H100 PCIe? expand_more
You can expect approximately 78 tokens/sec with a batch size of 18. Performance may vary depending on the inference framework and specific settings used.