The primary limiting factor in running large language models (LLMs) like DeepSeek-V2.5 is the available VRAM on the GPU. DeepSeek-V2.5, with its 236 billion parameters, requires a substantial 472GB of VRAM when using FP16 (half-precision floating point) for storing the model weights. The NVIDIA RTX 4080 SUPER, while a powerful card, only provides 16GB of VRAM. This creates a significant shortfall of 456GB, making it impossible to load the entire model into the GPU's memory at once using standard methods.
Furthermore, even if techniques like CPU offloading or aggressive quantization were employed, the limited memory bandwidth of 0.74 TB/s on the RTX 4080 SUPER would severely bottleneck performance. Accessing model weights from system RAM or even slower storage would introduce significant latency, drastically reducing the tokens generated per second. The 10240 CUDA cores and 320 Tensor Cores, while contributing to computational throughput, cannot compensate for the VRAM bottleneck in this scenario. This incompatibility prevents real-time or even practical inference speeds.
Due to the massive VRAM requirement of DeepSeek-V2.5, running it directly on an RTX 4080 SUPER is not feasible. Consider exploring alternative, smaller models that fit within the 16GB VRAM limit. For example, models with parameter counts in the single-digit billions are much more likely to run successfully. If you absolutely need to use DeepSeek-V2.5, you would need to explore distributed inference across multiple GPUs with NVLink or utilize cloud-based GPU instances equipped with significantly more VRAM, such as those offered by NelsaHost, that provide A100 or H100 GPUs.
Another approach, albeit with significant performance trade-offs, is to use CPU offloading techniques, where parts of the model are stored in system RAM and swapped in and out of the GPU as needed. However, this will lead to a substantial reduction in inference speed. Quantization can also help, but even extreme quantization may not be enough to fit the model within the available VRAM without unacceptable accuracy loss.