Phi-3 Medium on H100: Compatibility & Performance

info Technical Analysis

The NVIDIA H100 SXM, with its 80GB of HBM3 memory and 3.35 TB/s memory bandwidth, offers ample resources for running the Phi-3 Medium 14B model, especially when quantized to INT8. Phi-3 Medium 14B, requiring only 14GB of VRAM in INT8, leaves a substantial 66GB of VRAM headroom. This abundant VRAM allows for larger batch sizes and longer context lengths, maximizing GPU utilization. The H100's 16896 CUDA cores and 528 Tensor Cores are more than sufficient to handle the computational demands of this model, promising excellent inference speeds.

Furthermore, the high memory bandwidth of the H100 ensures that data can be transferred between the GPU and memory quickly, preventing bottlenecks during inference. The Hopper architecture's optimized Tensor Cores accelerate matrix multiplications, which are a core component of transformer-based models like Phi-3. This combination of high VRAM, memory bandwidth, and specialized hardware allows for efficient and fast inference. The estimated 90 tokens/sec demonstrates the powerful performance achievable with this configuration.

lightbulb Recommendation

Given the H100's capabilities, prioritize maximizing batch size to improve throughput. Experiment with different batch sizes, starting with the estimated 23, and monitor GPU utilization. If the GPU isn't fully utilized, increase the batch size further. Utilizing a framework like vLLM or NVIDIA's TensorRT will provide further optimization and potentially increase tokens/sec. Also, consider using techniques like speculative decoding to further boost inference speed.

While INT8 quantization provides a good balance of performance and memory usage, you could also experiment with FP16 or BF16 if higher precision is required and the slight increase in VRAM usage is acceptable. However, for most applications, INT8 should provide sufficient accuracy with significant performance benefits.

tune Recommended Settings

Batch_Size

23 (start and adjust based on GPU utilization)

Context_Length

128000 (or adjust based on application needs and …

Other_Settings

['Enable CUDA graph capture', 'Use Paged Attention', 'Experiment with speculative decoding']

Inference_Framework

vLLM or TensorRT

Quantization_Suggested

INT8 (default), consider FP16/BF16 for higher pre…

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA H100 SXM? expand_more

Yes, Phi-3 Medium 14B is fully compatible with the NVIDIA H100 SXM, offering substantial VRAM headroom and excellent performance.

What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more

Phi-3 Medium 14B requires approximately 14GB of VRAM when quantized to INT8.

How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA H100 SXM? expand_more

You can expect approximately 90 tokens/sec with the NVIDIA H100 SXM, with potential for further optimization using techniques like speculative decoding and optimized inference frameworks.

NelsaHost

Can I run Phi-3 Medium 14B (INT8 (8-bit Integer)) on NVIDIA H100 SXM?

VRAM Usage

Performance Estimate

info Technical Analysis

lightbulb Recommendation

tune Recommended Settings

help Frequently Asked Questions

GPU

AI Model

Alternative Quantizations

More with H100 SXM