Can I run Qwen 2.5 72B on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
144.0GB
Headroom
-120.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, equipped with 24GB of GDDR6X VRAM and a memory bandwidth of 1.01 TB/s, faces a significant challenge when running the Qwen 2.5 72B model. This large language model requires approximately 144GB of VRAM when using FP16 precision. The deficit of 120GB between the model's VRAM requirement and the GPU's capacity means that the model cannot be loaded and executed directly on the RTX 3090 Ti without employing specific techniques to reduce memory footprint. The 10752 CUDA cores and 336 Tensor cores of the RTX 3090 Ti would be sufficient for processing the model if it could fit in memory, but the memory limitation is the primary bottleneck.

lightbulb Recommendation

To run Qwen 2.5 72B on the RTX 3090 Ti, aggressive quantization techniques are essential. Consider using 4-bit or even 3-bit quantization methods offered by libraries like `llama.cpp` or `AutoGPTQ`. Furthermore, offloading some layers to system RAM is an option, although this will significantly reduce inference speed. Explore distributed inference across multiple GPUs if feasible, or consider using cloud-based GPU resources with sufficient VRAM to avoid these limitations. Without these optimizations, running the model locally on the RTX 3090 Ti is not practically viable.

tune Recommended Settings

Batch_Size
1
Context_Length
Consider truncating to fit within memory after qu…
Other_Settings
['Offload layers to CPU if necessary', 'Enable memory optimizations in inference framework']
Inference_Framework
llama.cpp / AutoGPTQ
Quantization_Suggested
4-bit / 3-bit

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
Not directly. The RTX 3090 Ti has insufficient VRAM (24GB) to load the Qwen 2.5 72B model (144GB required for FP16) without significant optimization.
What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more
The Qwen 2.5 72B model requires approximately 144GB of VRAM in FP16 precision. Quantization can reduce this requirement significantly.
How fast will Qwen 2.5 72B (72.00B) run on NVIDIA RTX 3090 Ti? expand_more
Performance will be limited due to VRAM constraints. Expect very slow inference speeds, potentially several seconds per token, even after aggressive quantization and offloading. Exact speed depends heavily on the chosen quantization method and other optimization settings.