The primary limiting factor for running LLaVA 1.6 13B on an RTX 4080 SUPER is the VRAM. LLaVA 1.6 13B, when running in FP16 (half-precision floating point), requires approximately 26GB of VRAM to load the model and perform computations. The RTX 4080 SUPER only provides 16GB of VRAM. This 10GB deficit means the model cannot be loaded entirely onto the GPU, leading to out-of-memory errors or requiring offloading to system RAM, which severely impacts performance.
While the RTX 4080 SUPER boasts a memory bandwidth of 0.74 TB/s and 10240 CUDA cores, these specifications become less relevant when the model cannot reside entirely in VRAM. Offloading layers or parameters to system RAM introduces significant latency, as the data transfer rate between the GPU and system RAM is far slower than the VRAM bandwidth. Even with the Ada Lovelace architecture's advancements in Tensor Cores (320), the model's performance will be bottlenecked by the VRAM limitation. Consequently, the expected tokens per second and maximum batch size will be significantly reduced compared to running the model on a GPU with sufficient VRAM.
Due to the VRAM limitations, running LLaVA 1.6 13B in FP16 on the RTX 4080 SUPER is not feasible without significant performance degradation. To make it work, consider using quantization techniques like 4-bit or 8-bit quantization. This reduces the model's memory footprint, potentially allowing it to fit within the 16GB VRAM. However, quantization may slightly reduce the model's accuracy.
Alternatively, explore using a different, smaller model that fits within the available VRAM. If running LLaVA 1.6 13B is crucial, consider upgrading to a GPU with more VRAM, such as an RTX 6000 Ada Generation or similar professional-grade card, or using multiple GPUs with model parallelism.