Can I run Phi-3 Mini 3.8B (INT8 (8-bit Integer)) on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
3.8GB
Headroom
+76.2GB

VRAM Usage

0GB 5% used 80.0GB

Performance Estimate

Tokens/sec ~117.0
Batch size 32
Context 128000K

info Technical Analysis

The NVIDIA H100 PCIe, with its substantial 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Mini 3.8B model. The model, when quantized to INT8, requires a mere 3.8GB of VRAM, leaving a significant 76.2GB headroom. This ample VRAM allows for large batch sizes and extended context lengths without encountering memory limitations. The H100's Hopper architecture, featuring 14592 CUDA cores and 456 Tensor Cores, provides substantial computational power for accelerating inference. The high memory bandwidth ensures rapid data transfer between the GPU and memory, preventing bottlenecks during model execution.

The INT8 quantization further enhances performance by reducing the memory footprint and computational demands of the model. This allows for faster processing and increased throughput. The estimated 117 tokens/sec indicates a very responsive inference speed, making the H100 an excellent choice for real-time applications. The large VRAM headroom also allows for experimentation with larger models or more complex inference pipelines without exceeding the GPU's capabilities. Given the H100's specifications, the Phi-3 Mini will operate with a high degree of efficiency, ensuring low latency and high throughput.

lightbulb Recommendation

Given the H100's capabilities, users should prioritize maximizing batch size to fully utilize the available resources. Experiment with batch sizes up to the estimated 32, and potentially higher, to optimize throughput. Utilizing a context length of 128000 tokens is feasible, but users should monitor performance to ensure responsiveness. Consider using inference frameworks optimized for NVIDIA GPUs, such as vLLM or NVIDIA's TensorRT, to further enhance performance. These frameworks can take advantage of the H100's Tensor Cores and other architectural features.

If the estimated performance is not achieved, ensure that the NVIDIA drivers are up to date and that the system is properly configured for GPU acceleration. Profiling tools can help identify potential bottlenecks in the inference pipeline. While INT8 quantization is sufficient, exploring FP16 or BF16 precision may yield further performance improvements if memory is not a constraint, but this is unlikely to be necessary given the substantial headroom. Always monitor GPU utilization and memory consumption to fine-tune settings for optimal performance.

tune Recommended Settings

Batch_Size
32 (experiment with higher values)
Context_Length
128000
Other_Settings
['Enable CUDA graph capture', 'Use persistent memory allocation', 'Optimize data loading pipeline']
Inference_Framework
vLLM or NVIDIA TensorRT
Quantization_Suggested
INT8 (already optimal)

help Frequently Asked Questions

Is Phi-3 Mini 3.8B (3.80B) compatible with NVIDIA H100 PCIe? expand_more
Yes, Phi-3 Mini 3.8B is perfectly compatible with the NVIDIA H100 PCIe.
What VRAM is needed for Phi-3 Mini 3.8B (3.80B)? expand_more
When quantized to INT8, Phi-3 Mini 3.8B requires approximately 3.8GB of VRAM.
How fast will Phi-3 Mini 3.8B (3.80B) run on NVIDIA H100 PCIe? expand_more
You can expect an estimated inference speed of around 117 tokens/sec on the NVIDIA H100 PCIe.