Can I run Qwen 2.5 72B (q3_k_m) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
28.8GB
Headroom
-4.8GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM, faces a challenge when running the Qwen 2.5 72B model, even with quantization. While the model's original FP16 precision demands a hefty 144GB of VRAM, quantizing to q3_k_m reduces this requirement to approximately 28.8GB. However, this still exceeds the RTX 3090 Ti's available VRAM by 4.8GB. This VRAM shortfall will prevent the model from loading and running directly on the GPU. The RTX 3090 Ti's 1.01 TB/s memory bandwidth and substantial CUDA and Tensor core counts would otherwise contribute to decent inference speeds if the model fit within the available memory.

lightbulb Recommendation

Due to the VRAM limitation, directly running Qwen 2.5 72B (q3_k_m) on the RTX 3090 Ti is not feasible. Consider offloading some layers to system RAM (CPU) using llama.cpp, although this will significantly reduce inference speed. Alternatively, explore using a smaller model variant of Qwen or other models with similar capabilities but lower VRAM footprints. Another option is to utilize cloud-based GPU services that offer instances with sufficient VRAM to accommodate the model.

tune Recommended Settings

Batch_Size
1
Context_Length
Reduce context length to minimize memory usage
Other_Settings
['Use --threads to maximize CPU utilization', 'Enable memory mapping (--mlock)', 'Experiment with different numbers of layers offloaded to the CPU']
Inference_Framework
llama.cpp (with CPU offloading)
Quantization_Suggested
q4_k_m or lower (if available and supported)

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, Qwen 2.5 72B, even when quantized to q3_k_m, requires more VRAM (28.8GB) than the RTX 3090 Ti provides (24GB).
What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more
The VRAM required depends on the precision and quantization level. In FP16, it needs 144GB. Quantized to q3_k_m, it requires approximately 28.8GB.
How fast will Qwen 2.5 72B (72.00B) run on NVIDIA RTX 3090 Ti? expand_more
Due to insufficient VRAM, the model cannot run directly on the RTX 3090 Ti. Offloading layers to CPU will result in significantly slower performance compared to a GPU with sufficient VRAM.