Can I run Phi-3 Medium 14B (q3_k_m) on NVIDIA RTX 3090?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
24.0GB
Required
5.6GB
Headroom
+18.4GB

VRAM Usage

0GB 23% used 24.0GB

Performance Estimate

Tokens/sec ~60.0
Batch size 6
Context 128000K

info Technical Analysis

The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, is exceptionally well-suited for running the Phi-3 Medium 14B model, especially when employing quantization. The q3_k_m quantization reduces the model's VRAM footprint to approximately 5.6GB, leaving a substantial 18.4GB of VRAM headroom. This surplus allows for comfortable operation, accommodating larger batch sizes and extended context lengths without encountering memory constraints. The RTX 3090's high memory bandwidth (0.94 TB/s) ensures rapid data transfer between the GPU and memory, preventing bottlenecks during inference. Furthermore, the abundance of CUDA cores (10496) and Tensor Cores (328) accelerate the matrix multiplications and other computationally intensive operations inherent in LLM inference.

lightbulb Recommendation

Given the ample VRAM and computational power of the RTX 3090, users should prioritize maximizing throughput and response quality. Experiment with larger batch sizes (up to 6) to improve tokens/sec. While the provided context length of 128000 tokens is supported, consider the specific use case. For tasks not requiring such extensive context, reducing the context length could further improve inference speed. Additionally, explore different inference frameworks to optimize performance; llama.cpp is a solid starting point for its flexibility and broad compatibility, but vLLM or TensorRT might offer further speed improvements.

tune Recommended Settings

Batch_Size
6
Context_Length
128000
Other_Settings
['Enable CUDA acceleration', 'Experiment with different quantization levels if needed', 'Monitor GPU utilization to fine-tune batch size']
Inference_Framework
llama.cpp
Quantization_Suggested
q3_k_m

help Frequently Asked Questions

Is Phi-3 Medium 14B (14.00B) compatible with NVIDIA RTX 3090? expand_more
Yes, Phi-3 Medium 14B is fully compatible with the NVIDIA RTX 3090, especially when using quantization.
What VRAM is needed for Phi-3 Medium 14B (14.00B)? expand_more
With q3_k_m quantization, Phi-3 Medium 14B requires approximately 5.6GB of VRAM.
How fast will Phi-3 Medium 14B (14.00B) run on NVIDIA RTX 3090? expand_more
You can expect an estimated performance of around 60 tokens/sec with a batch size of 6, but this can vary based on the specific inference framework and settings used.