The primary bottleneck in running Qwen 2.5 72B on an RTX 4090 is the VRAM limitation. Qwen 2.5 72B, even when quantized to INT8, requires approximately 72GB of VRAM. The RTX 4090, with its 24GB of VRAM, falls significantly short, resulting in a VRAM deficit of 48GB. This discrepancy prevents the model from being loaded and executed directly on the GPU. The model's parameters simply cannot fit into the available memory. While the RTX 4090 boasts a high memory bandwidth of 1.01 TB/s and powerful CUDA and Tensor cores, these capabilities are rendered unusable without sufficient VRAM to hold the model.
Due to the VRAM limitations, directly running Qwen 2.5 72B (INT8) on a single RTX 4090 is not feasible. Consider using CPU offloading or splitting the model across multiple GPUs if possible. Another option is to explore more aggressive quantization techniques such as INT4 or even lower precision methods, which can significantly reduce the VRAM footprint, although this may impact the model's accuracy. As a last resort, consider using a smaller model, such as Qwen 2.5 7B, which would fit within the RTX 4090's VRAM.