Qwen 2.5 72B on RTX 3090 Ti: Compatibility & Optimization

info Technical Analysis

The NVIDIA RTX 3090 Ti, equipped with 24GB of GDDR6X VRAM and a memory bandwidth of 1.01 TB/s, faces a significant challenge when running the Qwen 2.5 72B model. This large language model requires approximately 144GB of VRAM when using FP16 precision. The deficit of 120GB between the model's VRAM requirement and the GPU's capacity means that the model cannot be loaded and executed directly on the RTX 3090 Ti without employing specific techniques to reduce memory footprint. The 10752 CUDA cores and 336 Tensor cores of the RTX 3090 Ti would be sufficient for processing the model if it could fit in memory, but the memory limitation is the primary bottleneck.

lightbulb Recommendation

To run Qwen 2.5 72B on the RTX 3090 Ti, aggressive quantization techniques are essential. Consider using 4-bit or even 3-bit quantization methods offered by libraries like `llama.cpp` or `AutoGPTQ`. Furthermore, offloading some layers to system RAM is an option, although this will significantly reduce inference speed. Explore distributed inference across multiple GPUs if feasible, or consider using cloud-based GPU resources with sufficient VRAM to avoid these limitations. Without these optimizations, running the model locally on the RTX 3090 Ti is not practically viable.

tune Recommended Settings

Batch_Size

1

Context_Length

Consider truncating to fit within memory after qu…

Other_Settings

['Offload layers to CPU if necessary', 'Enable memory optimizations in inference framework']

Inference_Framework

llama.cpp / AutoGPTQ

Quantization_Suggested

4-bit / 3-bit

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA RTX 3090 Ti? expand_more

Not directly. The RTX 3090 Ti has insufficient VRAM (24GB) to load the Qwen 2.5 72B model (144GB required for FP16) without significant optimization.

What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more

The Qwen 2.5 72B model requires approximately 144GB of VRAM in FP16 precision. Quantization can reduce this requirement significantly.

How fast will Qwen 2.5 72B (72.00B) run on NVIDIA RTX 3090 Ti? expand_more

Performance will be limited due to VRAM constraints. Expect very slow inference speeds, potentially several seconds per token, even after aggressive quantization and offloading. Exact speed depends heavily on the chosen quantization method and other optimization settings.

NelsaHost

Can I run Qwen 2.5 72B on NVIDIA RTX 3090 Ti?

VRAM Usage

info Technical Analysis

lightbulb Recommendation

tune Recommended Settings

help Frequently Asked Questions

GPU

AI Model

More with RTX 3090 Ti