Can I run Phi-3 Medium 14B on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
28.0GB
Headroom
-4.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM, falls short of the 28GB VRAM requirement for running the Phi-3 Medium 14B model in FP16 (half-precision). This means you won't be able to load the entire model into the GPU's memory, preventing successful inference without optimization. While the 3090 Ti boasts a substantial memory bandwidth of 1.01 TB/s and a high number of CUDA and Tensor cores, the primary bottleneck here is the insufficient VRAM capacity.

Even if the model could be squeezed into the available VRAM, the limited headroom would likely result in significantly reduced performance. Operations would be heavily reliant on swapping data between the GPU and system memory, which is considerably slower than on-chip VRAM access. This would lead to lower tokens per second and a smaller effective batch size, making real-time or interactive applications impractical. Memory bandwidth, while high, cannot compensate for the fundamental lack of sufficient VRAM to hold the entire model.

lightbulb Recommendation

To run Phi-3 Medium 14B on your RTX 3090 Ti, you'll need to employ quantization techniques. Quantization reduces the memory footprint of the model by representing its weights and activations with fewer bits. Consider using a quantization level like Q4_K_M or even lower (e.g., Q2_K) if necessary. This will significantly reduce the VRAM requirements, potentially bringing it within the 3090 Ti's 24GB capacity.

Alternatively, explore offloading some layers of the model to system RAM. Frameworks like llama.cpp allow for this, but it will noticeably decrease inference speed. Before attempting this, thoroughly explore quantization options. If performance is critical and quantization is insufficient, consider using a more powerful GPU with more VRAM, such as an RTX 4090 or a professional-grade GPU like an A100.

tune Recommended Settings

Batch_Size
1
Context_Length
4096
Other_Settings
['Experiment with different quantization levels (Q5_K_M, Q3_K_M, Q2_K) to find the best balance between VRAM usage and performance.', 'Enable GPU acceleration in llama.cpp.', 'Reduce context length to minimize VRAM usage, if possible.', 'Monitor VRAM usage closely during inference to avoid out-of-memory errors.']
Inference_Framework
llama.cpp
Quantization_Suggested
Q4_K_M

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
Not directly. The RTX 3090 Ti's 24GB VRAM is insufficient for the model's 28GB FP16 requirement. Quantization or layer offloading is needed.
What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more
The Phi-3 Medium 14B model requires approximately 28GB of VRAM in FP16 precision. Quantization can significantly reduce this requirement.
How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA RTX 3090 Ti? expand_more
Performance will depend heavily on the quantization level and other optimization techniques used. Expect reduced tokens per second compared to running the model in FP16 on a GPU with sufficient VRAM. Experimentation is needed to find optimal settings.