Can I run Qwen 2.5 72B (q3_k_m) on NVIDIA RTX 4090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
28.8GB
Headroom
-4.8GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The Qwen 2.5 72B model, even when quantized to q3_k_m, requires 28.8GB of VRAM to operate. The NVIDIA RTX 4090, while a powerful GPU, is equipped with 24GB of VRAM. This creates a VRAM shortfall of 4.8GB, making it impossible to load the entire model onto the GPU for inference. The model's 72 billion parameters necessitate a significant amount of memory for storing weights and activations during computation. The q3_k_m quantization reduces the memory footprint compared to FP16 (which would require 144GB), but it is still insufficient for the RTX 4090's VRAM capacity.

Furthermore, even if the model could somehow fit into the available VRAM (which it cannot), the memory bandwidth of 1.01 TB/s on the RTX 4090 could become a bottleneck. Large language models like Qwen 2.5 72B involve extensive data movement between memory and compute units. The limited VRAM further exacerbates this issue, as it forces the system to rely on slower system memory (RAM) through PCIe, significantly degrading performance. The 16384 CUDA cores and 512 Tensor Cores on the RTX 4090 would be underutilized due to the VRAM constraint.

lightbulb Recommendation

Unfortunately, running the q3_k_m quantized Qwen 2.5 72B model on a single RTX 4090 is not feasible due to the VRAM limitation. To run this model, you will need a GPU with at least 28.8GB of VRAM or explore alternative strategies. Consider using a different quantization method that further reduces VRAM usage, such as q2_k or even lower, though this will come at the cost of accuracy. Alternatively, explore methods like model parallelism, where the model is split across multiple GPUs, each handling a portion of the computation. For local use, consider smaller models like the Qwen1.5 14B or similar models that fit within the RTX 4090's VRAM.

tune Recommended Settings

Batch_Size
1 (if CPU offloading is used; otherwise, irreleva…
Context_Length
Reduce context length to the minimum required for…
Other_Settings
['Enable CPU offloading in llama.cpp to move some layers to system RAM (very slow).', 'Use model parallelism with vLLM if you have multiple GPUs available.']
Inference_Framework
llama.cpp (for CPU offloading) or vLLM (if using …
Quantization_Suggested
q2_k or lower (if accuracy loss is acceptable)

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA RTX 4090? expand_more
No, the Qwen 2.5 72B model, even when quantized to q3_k_m, is not compatible with the NVIDIA RTX 4090 due to insufficient VRAM.
What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more
The Qwen 2.5 72B model requires at least 28.8GB of VRAM when quantized to q3_k_m. FP16 precision would require 144GB.
How fast will Qwen 2.5 72B (72.00B) run on NVIDIA RTX 4090? expand_more
The Qwen 2.5 72B model will not run on the NVIDIA RTX 4090 due to insufficient VRAM. It will likely result in an out-of-memory error.