Can I run Qwen 2.5 32B (INT8 (8-bit Integer)) on NVIDIA RTX 3090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
32.0GB
Headroom
-8.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, falls short of the 32GB VRAM needed to run the INT8 quantized Qwen 2.5 32B model. Even with quantization, the model's memory footprint exceeds the GPU's capacity, preventing successful loading and inference. The RTX 3090's 0.94 TB/s memory bandwidth and 10496 CUDA cores would otherwise contribute to reasonable inference speeds if sufficient VRAM were available. Without enough VRAM, the system will likely encounter out-of-memory errors, making real-time or even batch processing impossible.

While the Ampere architecture of the RTX 3090 is well-suited for AI tasks, the VRAM limitation is the primary bottleneck in this scenario. The 328 Tensor Cores would be utilized for accelerating matrix multiplications, which are central to LLM inference. However, these cores cannot function if the model's parameters cannot be fully loaded into the GPU memory. The model's context length of 131072 tokens further exacerbates the VRAM demand, as longer contexts require more memory for attention mechanisms and intermediate calculations.

lightbulb Recommendation

Given the VRAM deficit, running the Qwen 2.5 32B model on the RTX 3090 requires either offloading layers to system RAM (which significantly reduces performance) or exploring more aggressive quantization techniques. Consider using a framework like `llama.cpp` which can offload layers to CPU, though this will drastically reduce inference speed. Alternatively, investigate further quantization to 4-bit (QLoRA or similar) which might bring the model size within the RTX 3090's VRAM limit. If performance is critical, consider using a GPU with at least 32GB of VRAM or distributing the model across multiple GPUs using model parallelism.

tune Recommended Settings

Batch_Size
1
Context_Length
Reduce context length if possible to minimize VRA…
Other_Settings
['Offload layers to CPU', 'Enable memory optimizations in the inference framework', 'Use smaller data types where possible']
Inference_Framework
llama.cpp
Quantization_Suggested
QLoRA (4-bit quantization)

help Frequently Asked Questions

Is Qwen 2.5 32B (32.00B) compatible with NVIDIA RTX 3090? expand_more
No, the RTX 3090's 24GB VRAM is insufficient for the 32GB required by the INT8 quantized Qwen 2.5 32B model.
What VRAM is needed for Qwen 2.5 32B (32.00B)? expand_more
The INT8 quantized version of Qwen 2.5 32B requires approximately 32GB of VRAM.
How fast will Qwen 2.5 32B (32.00B) run on NVIDIA RTX 3090? expand_more
It will likely not run due to insufficient VRAM. If forced to run by offloading to system RAM or aggressive quantization, the performance will be significantly degraded and likely too slow for practical use.