Can I run Qwen 2.5 72B (INT8 (8-bit Integer)) on NVIDIA RTX 4090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
72.0GB
Headroom
-48.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The primary bottleneck in running Qwen 2.5 72B on an RTX 4090 is the VRAM limitation. Qwen 2.5 72B, even when quantized to INT8, requires approximately 72GB of VRAM. The RTX 4090, with its 24GB of VRAM, falls significantly short, resulting in a VRAM deficit of 48GB. This discrepancy prevents the model from being loaded and executed directly on the GPU. The model's parameters simply cannot fit into the available memory. While the RTX 4090 boasts a high memory bandwidth of 1.01 TB/s and powerful CUDA and Tensor cores, these capabilities are rendered unusable without sufficient VRAM to hold the model.

lightbulb Recommendation

Due to the VRAM limitations, directly running Qwen 2.5 72B (INT8) on a single RTX 4090 is not feasible. Consider using CPU offloading or splitting the model across multiple GPUs if possible. Another option is to explore more aggressive quantization techniques such as INT4 or even lower precision methods, which can significantly reduce the VRAM footprint, although this may impact the model's accuracy. As a last resort, consider using a smaller model, such as Qwen 2.5 7B, which would fit within the RTX 4090's VRAM.

tune Recommended Settings

Batch_Size
1 (if CPU offloading is used)
Context_Length
Reduce context length to minimize VRAM usage.
Other_Settings
['Enable CPU offloading', 'Explore multi-GPU parallelism if available', 'Optimize for minimal memory footprint']
Inference_Framework
llama.cpp (with appropriate flags for CPU offload…
Quantization_Suggested
INT4 or lower (if accuracy loss is acceptable)

help Frequently Asked Questions

Is Qwen 2.5 72B (72.00B) compatible with NVIDIA RTX 4090? expand_more
No, Qwen 2.5 72B is not directly compatible with a single NVIDIA RTX 4090 due to insufficient VRAM.
What VRAM is needed for Qwen 2.5 72B (72.00B)? expand_more
Qwen 2.5 72B (INT8) requires approximately 72GB of VRAM.
How fast will Qwen 2.5 72B (72.00B) run on NVIDIA RTX 4090? expand_more
Qwen 2.5 72B will likely not run on a single RTX 4090 without significant modifications such as CPU offloading or extreme quantization, which will drastically reduce performance. Expect very low tokens/second if CPU offloading is used.