Can I run Phi-3 Small 7B (q3_k_m) on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
2.8GB
Headroom
+77.2GB

VRAM Usage

0GB 3% used 80.0GB

Performance Estimate

Tokens/sec ~117.0
Batch size 32
Context 128000K

info Technical Analysis

The NVIDIA H100 PCIe, with its substantial 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Phi-3 Small 7B model. Phi-3 Small 7B, even in its unquantized FP16 precision, requires only 14GB of VRAM. When quantized to q3_k_m, the VRAM footprint shrinks dramatically to just 2.8GB. This leaves a massive 77.2GB of VRAM headroom, ensuring ample space for larger batch sizes, extended context lengths, and other memory-intensive operations. The H100's 14592 CUDA cores and 456 Tensor Cores further contribute to its ability to efficiently process the model's computations, resulting in high throughput.

lightbulb Recommendation

Given the H100's capabilities and the model's relatively small size (especially after quantization), you can experiment with larger batch sizes to maximize throughput. Start with a batch size of 32, as indicated, and incrementally increase it until you observe diminishing returns or encounter memory constraints (which are unlikely with this setup). Explore different inference frameworks like `vLLM` or `text-generation-inference` which are optimized for high throughput and low latency. Consider using techniques like speculative decoding to further improve the tokens/sec rate. If you aren't already using it, confirm that you have the latest NVIDIA drivers installed to ensure optimal performance and compatibility.

tune Recommended Settings

Batch_Size
32-64 (experiment to find optimal)
Context_Length
128000
Other_Settings
['Enable CUDA graph capture', 'Use PagedAttention', 'Experiment with speculative decoding']
Inference_Framework
vLLM
Quantization_Suggested
q3_k_m

help Frequently Asked Questions

Is Phi-3 Small 7B (7.00B) compatible with NVIDIA H100 PCIe? expand_more
Yes, Phi-3 Small 7B is perfectly compatible with the NVIDIA H100 PCIe. The H100 provides significant VRAM and processing power to run the model efficiently.
What VRAM is needed for Phi-3 Small 7B (7.00B)? expand_more
The VRAM needed for Phi-3 Small 7B depends on the precision and quantization level. In FP16, it requires approximately 14GB. With q3_k_m quantization, the VRAM requirement drops to around 2.8GB.
How fast will Phi-3 Small 7B (7.00B) run on NVIDIA H100 PCIe? expand_more
With q3_k_m quantization, you can expect approximately 117 tokens/sec. This can vary depending on batch size, context length, and the specific inference framework used. Experimentation with different settings is encouraged to optimize performance.