Can I run Qwen 2.5 72B (Q4_K_M (GGUF 4-bit)) on NVIDIA RTX 4090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
36.0GB
Headroom
-12.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 4090, with its 24GB of GDDR6X VRAM, falls short of the 36GB VRAM requirement for running Qwen 2.5 72B (72.00B) quantized to Q4_K_M. This means the entire model cannot be loaded onto the GPU, preventing successful inference. While the RTX 4090 boasts a high memory bandwidth of 1.01 TB/s and a substantial number of CUDA and Tensor cores, these specifications become irrelevant when the model exceeds the available VRAM. Attempting to run the model in this configuration will result in errors, as the system will be unable to allocate the necessary memory for the model's weights and activations.

lightbulb Recommendation

Due to the VRAM limitations, directly running Qwen 2.5 72B (72.00B) on a single RTX 4090 is not feasible. Consider exploring model parallelism across multiple GPUs if available, which involves splitting the model across several GPUs to distribute the VRAM load. Alternatively, explore further quantization options, such as Q2 or even lower precisions, which might reduce the VRAM footprint at the cost of some accuracy. Finally, for single RTX 4090 usage, focus on smaller models with parameter counts that fit within the 24GB VRAM limit.

tune Recommended Settings

Batch_Size
1 (if any success with lower quantization)
Context_Length
Potentially reduce context length to free up VRAM…
Other_Settings
['Offload some layers to CPU (very slow)', 'Enable memory optimizations in llama.cpp']
Inference_Framework
llama.cpp
Quantization_Suggested
Q2_K or lower (if available and acceptable accura…

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA RTX 4090? expand_more
No, Qwen 2.5 72B (72.00B) is not directly compatible with the NVIDIA RTX 4090 due to insufficient VRAM.
What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more
Qwen 2.5 72B (72.00B) requires at least 36GB of VRAM when quantized to Q4_K_M. FP16 requires 144GB.
How fast will Qwen 2.5 72B (72.00B) run on NVIDIA RTX 4090? expand_more
It will not run on the RTX 4090 in the tested configuration. Even with aggressive quantization or offloading, performance will likely be very slow if it runs at all.