Can I run Qwen 2.5 72B (INT8 (8-bit Integer)) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
72.0GB
Headroom
-48.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, while a powerful GPU, falls short of the VRAM requirements for running the Qwen 2.5 72B model, even with INT8 quantization. Qwen 2.5 72B, a large language model with 72 billion parameters, necessitates substantial memory to store the model weights and intermediate activations during inference. Quantization to INT8 reduces the VRAM footprint compared to FP16, but still requires approximately 72GB of VRAM. The RTX 3090 Ti offers only 24GB of VRAM, creating a significant shortfall of 48GB. This means the entire model cannot be loaded onto the GPU at once, leading to out-of-memory errors and preventing successful inference.

lightbulb Recommendation

Due to the VRAM limitations, running Qwen 2.5 72B on a single RTX 3090 Ti is not feasible. Consider these options: 1) **GPU Clustering/Multi-GPU setup:** Utilize multiple GPUs with sufficient combined VRAM to accommodate the model. This requires specialized software and expertise. 2) **CPU Offloading:** Offload some layers of the model to the CPU, which uses system RAM. This will significantly slow down inference speed. 3) **Smaller Model:** Opt for a smaller language model that fits within the RTX 3090 Ti's VRAM. Models with fewer parameters, such as Qwen 1.5 14B, or quantized versions of similar size models, might be viable alternatives. 4) **Cloud-based Inference:** Utilize cloud-based services that offer access to powerful GPUs with ample VRAM.

tune Recommended Settings

Batch_Size
1 (extremely limited by VRAM)
Context_Length
Reduce context length significantly to minimize m…
Other_Settings
['Enable CPU offloading', 'Experiment with different quantization methods (GPTQ, bitsandbytes)', 'Monitor VRAM usage closely', 'Reduce number of layers loaded onto the GPU']
Inference_Framework
llama.cpp (with CPU offloading) or potentially ex…
Quantization_Suggested
GPTQ or similar extreme quantization methods may …

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, Qwen 2.5 72B is not directly compatible with the NVIDIA RTX 3090 Ti due to insufficient VRAM.
What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more
Qwen 2.5 72B requires approximately 72GB of VRAM when quantized to INT8.
How fast will Qwen 2.5 72B (72.00B) run on NVIDIA RTX 3090 Ti? expand_more
Qwen 2.5 72B will likely not run on the NVIDIA RTX 3090 Ti without significant modifications like CPU offloading or extreme quantization, which will drastically reduce inference speed. Expect very slow performance, potentially unusable for real-time applications.