The Qwen 2.5 72B model, even when quantized to q3_k_m, requires 28.8GB of VRAM to operate. The NVIDIA RTX 4090, while a powerful GPU, is equipped with 24GB of VRAM. This creates a VRAM shortfall of 4.8GB, making it impossible to load the entire model onto the GPU for inference. The model's 72 billion parameters necessitate a significant amount of memory for storing weights and activations during computation. The q3_k_m quantization reduces the memory footprint compared to FP16 (which would require 144GB), but it is still insufficient for the RTX 4090's VRAM capacity.
Furthermore, even if the model could somehow fit into the available VRAM (which it cannot), the memory bandwidth of 1.01 TB/s on the RTX 4090 could become a bottleneck. Large language models like Qwen 2.5 72B involve extensive data movement between memory and compute units. The limited VRAM further exacerbates this issue, as it forces the system to rely on slower system memory (RAM) through PCIe, significantly degrading performance. The 16384 CUDA cores and 512 Tensor Cores on the RTX 4090 would be underutilized due to the VRAM constraint.
Unfortunately, running the q3_k_m quantized Qwen 2.5 72B model on a single RTX 4090 is not feasible due to the VRAM limitation. To run this model, you will need a GPU with at least 28.8GB of VRAM or explore alternative strategies. Consider using a different quantization method that further reduces VRAM usage, such as q2_k or even lower, though this will come at the cost of accuracy. Alternatively, explore methods like model parallelism, where the model is split across multiple GPUs, each handling a portion of the computation. For local use, consider smaller models like the Qwen1.5 14B or similar models that fit within the RTX 4090's VRAM.