Can I run Phi-3 Medium 14B (INT8 (8-bit Integer)) on NVIDIA RTX 3090 Ti?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
24.0GB
Required
14.0GB
Headroom
+10.0GB

VRAM Usage

0GB 58% used 24.0GB

Performance Estimate

Tokens/sec ~60.0
Batch size 3
Context 128000K

info Technical Analysis

The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM, provides ample memory for running the Phi-3 Medium 14B model, especially when using INT8 quantization. The model requires approximately 14GB of VRAM in INT8, leaving a substantial 10GB headroom. This headroom is beneficial for handling larger context lengths, increasing batch sizes, or running other applications concurrently. The 3090 Ti's 1.01 TB/s memory bandwidth ensures efficient data transfer between the GPU and memory, which is crucial for maintaining high inference speeds.

Furthermore, the RTX 3090 Ti's Ampere architecture, equipped with 10752 CUDA cores and 336 Tensor cores, is well-suited for accelerating the matrix multiplications and other computations inherent in large language models. Tensor cores, specifically designed for deep learning workloads, significantly boost performance compared to using only CUDA cores. The estimated 60 tokens/sec performance is a reasonable expectation given the model size and GPU capabilities. However, actual performance can vary based on the specific inference framework used and the level of optimization applied.

lightbulb Recommendation

To maximize performance, it's recommended to use an optimized inference framework such as `llama.cpp` or `vLLM`, which are designed to leverage the RTX 3090 Ti's hardware capabilities effectively. Experiment with different batch sizes to find the optimal balance between throughput and latency. While a batch size of 3 is a good starting point, increasing it might improve tokens/sec, but could also increase latency. Consider using techniques like speculative decoding if available in your chosen framework to further boost performance.

Also, ensure that the drivers are up to date to benefit from the latest performance improvements and bug fixes. Monitor GPU utilization and memory usage during inference to identify potential bottlenecks. If the model still doesn't fit in VRAM or if performance is insufficient, consider using further quantization to INT4, or offloading some layers to CPU, though this will impact speed.

tune Recommended Settings

Batch_Size
3 (experiment with higher values)
Context_Length
128000 (adjust based on application needs and per…
Other_Settings
['Enable CUDA graph capture if supported', 'Use Paged Attention mechanisms', 'Update to the latest NVIDIA drivers', 'Experiment with speculative decoding']
Inference_Framework
llama.cpp or vLLM
Quantization_Suggested
INT8 (currently used, consider INT4 for lower VRA…

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
Yes, Phi-3 Medium 14B is fully compatible with the NVIDIA RTX 3090 Ti, especially when using INT8 quantization.
What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more
Phi-3 Medium 14B requires approximately 14GB of VRAM when quantized to INT8.
How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA RTX 3090 Ti? expand_more
You can expect an estimated performance of around 60 tokens/sec on the RTX 3090 Ti, but this can vary based on the inference framework and settings used.