Can I run Qwen 2.5 72B (q3_k_m) on NVIDIA RTX 3090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
28.8GB
Headroom
-4.8GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, faces a significant challenge running the Qwen 2.5 72B model, even in its quantized q3_k_m form. This quantization brings the model's VRAM footprint down to 28.8GB, but it still exceeds the RTX 3090's capacity by 4.8GB. While quantization reduces the memory footprint by representing weights with fewer bits, it doesn't eliminate the need to load the entire model into VRAM for efficient inference. The high memory bandwidth of 0.94 TB/s on the RTX 3090 is beneficial, but insufficient VRAM remains the primary bottleneck.

Due to the VRAM limitation, a direct, single-GPU inference is impossible. The model's parameters simply cannot fit entirely on the RTX 3090. This means you won't be able to load the entire model onto the GPU, leading to out-of-memory errors or requiring alternative approaches like offloading layers to system RAM, which significantly degrades performance. The 10496 CUDA cores and 328 Tensor cores on the RTX 3090 are powerful, but their potential is limited by the inability to keep them fully utilized with the model loaded in VRAM.

lightbulb Recommendation

Given the VRAM constraints, running the Qwen 2.5 72B model on a single RTX 3090 is not feasible. Consider these options: 1) Offload some layers to system RAM, but be aware of substantial performance slowdowns. 2) Use a multi-GPU setup if possible, distributing the model across multiple GPUs with sufficient combined VRAM. 3) Explore more aggressive quantization methods, such as Q2 or even lower, although this will impact model accuracy. 4) Use a smaller model variant, such as Qwen 2.5 14B, which would fit within the RTX 3090's VRAM. 5) Leverage cloud-based inference services that offer sufficient GPU resources.

If offloading layers to system RAM is the only option, experiment with different layer configurations to minimize the performance impact. Consider using inference frameworks that support efficient CPU offloading and optimization. Prioritize layers with less computational intensity for offloading to mitigate the performance penalty. Be prepared for significantly lower tokens/second compared to running the entire model on the GPU.

tune Recommended Settings

Batch_Size
1 (to minimize VRAM usage)
Context_Length
Reduce context length to the minimum acceptable f…
Other_Settings
['Enable CPU offloading (if using llama.cpp)', 'Experiment with different numbers of layers offloaded to CPU', 'Monitor VRAM usage closely to avoid out-of-memory errors']
Inference_Framework
llama.cpp (for CPU offloading) or potentially vLL…
Quantization_Suggested
Q2_K or lower (be mindful of accuracy trade-offs)

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA RTX 3090? expand_more
No, the Qwen 2.5 72B model, even in q3_k_m quantized form, requires more VRAM than the RTX 3090 offers.
What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more
The q3_k_m quantized version of Qwen 2.5 72B requires approximately 28.8GB of VRAM.
How fast will Qwen 2.5 72B (72.00B) run on NVIDIA RTX 3090? expand_more
It will likely not run at all without modifications due to insufficient VRAM. If you offload layers to CPU, expect significantly reduced performance, potentially single-digit tokens per second.