Can I run Phi-3 Medium 14B (INT8 (8-bit Integer)) on NVIDIA RTX 3090?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
24.0GB
Required
14.0GB
Headroom
+10.0GB

VRAM Usage

0GB 58% used 24.0GB

Performance Estimate

Tokens/sec ~60.0
Batch size 3
Context 128000K

info Technical Analysis

The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, is well-suited for running the Phi-3 Medium 14B model, especially when using INT8 quantization. Quantization reduces the model's memory footprint, bringing the VRAM requirement down to a manageable 14GB. This leaves a significant 10GB VRAM headroom, which is beneficial for handling larger batch sizes and longer context lengths without encountering out-of-memory errors. The RTX 3090's substantial memory bandwidth of 0.94 TB/s ensures rapid data transfer between the GPU and memory, crucial for maintaining high inference speeds. Furthermore, the 10496 CUDA cores and 328 Tensor Cores provide ample computational power to accelerate the matrix multiplications and other operations inherent in transformer-based models like Phi-3.

lightbulb Recommendation

For optimal performance with Phi-3 Medium 14B on the RTX 3090, prioritize using an efficient inference framework like `llama.cpp` or `vLLM`. Experiment with different batch sizes to find a balance between throughput and latency. A batch size of 3 is a good starting point, but increasing it can significantly improve tokens/sec if your application is less sensitive to latency. Also, consider using a context length smaller than the maximum of 128000 if you don't need the full length, as shorter contexts generally lead to faster processing. Monitor GPU utilization and VRAM usage to fine-tune these parameters for your specific use case. Profile your application and consider other optimization techniques like attention mechanisms to further optimize performance.

tune Recommended Settings

Batch_Size
3
Context_Length
64000
Other_Settings
['Enable CUDA graph capture', 'Use Paged Attention', 'Experiment with different scheduling algorithms in vLLM']
Inference_Framework
vLLM
Quantization_Suggested
INT8

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA RTX 3090? expand_more
Yes, Phi-3 Medium 14B is fully compatible with the NVIDIA RTX 3090, especially when using INT8 quantization.
What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more
With INT8 quantization, Phi-3 Medium 14B requires approximately 14GB of VRAM.
How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA RTX 3090? expand_more
You can expect an estimated throughput of around 60 tokens/sec on the RTX 3090 with INT8 quantization and a reasonable batch size. Actual performance may vary depending on the specific implementation, context length, and other system factors.