Can I run Phi-3 Medium 14B (INT8 (8-bit Integer)) on NVIDIA A100 40GB?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
40.0GB
Required
14.0GB
Headroom
+26.0GB

VRAM Usage

0GB 35% used 40.0GB

Performance Estimate

Tokens/sec ~78.0
Batch size 9
Context 128000K

info Technical Analysis

The NVIDIA A100 40GB GPU is exceptionally well-suited for running the Phi-3 Medium 14B model, particularly in its INT8 quantized form. Phi-3 Medium 14B, quantized to INT8, requires approximately 14GB of VRAM. The A100, with its 40GB of HBM2e memory, provides a substantial 26GB VRAM headroom. This surplus not only ensures the model fits comfortably within the GPU's memory but also allows for larger batch sizes, longer context lengths, and concurrent execution of other tasks or models. The A100's impressive 1.56 TB/s memory bandwidth is more than sufficient to feed data to the Tensor Cores, preventing memory bandwidth from becoming a bottleneck during inference.

lightbulb Recommendation

For optimal performance with Phi-3 Medium 14B on the A100, leverage inference frameworks like vLLM or NVIDIA's TensorRT. Experiment with batch sizes up to the estimated value of 9 to maximize throughput. Given the ample VRAM, consider increasing the context length towards the model's maximum of 128,000 tokens if your application requires it. Monitor GPU utilization and memory usage to fine-tune these parameters. If you are not already using INT8 quantization, it is highly recommended to reduce the VRAM footprint and increase performance.

tune Recommended Settings

Batch_Size
9
Context_Length
128000
Other_Settings
['Enable CUDA graph capture for reduced latency', 'Utilize TensorRT for optimized kernel execution', 'Experiment with different attention mechanisms for further speedups']
Inference_Framework
vLLM or TensorRT
Quantization_Suggested
INT8 (if not already applied)

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA A100 40GB? expand_more
Yes, Phi-3 Medium 14B is fully compatible with the NVIDIA A100 40GB, offering excellent performance and sufficient VRAM headroom, especially when using INT8 quantization.
What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more
When quantized to INT8, Phi-3 Medium 14B requires approximately 14GB of VRAM. In FP16 it needs 28GB of VRAM.
How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA A100 40GB? expand_more
With INT8 quantization, expect approximately 78 tokens per second on the NVIDIA A100 40GB. Performance will vary depending on the inference framework, batch size, and context length.