The primary limiting factor for running LLaVA 1.6 13B on an RTX 4070 SUPER is the VRAM. LLaVA 1.6 13B, in FP16 precision, requires approximately 26GB of VRAM to load the model weights and manage intermediate activations during inference. The RTX 4070 SUPER only provides 12GB of VRAM. This 14GB deficit means the model cannot be loaded in its full FP16 form, leading to an immediate incompatibility. The memory bandwidth of 0.5 TB/s on the RTX 4070 SUPER is sufficient for smaller models, but the insufficient VRAM prevents its effective utilization with LLaVA 1.6 13B.
Even if the model could be forced to load, the lack of sufficient VRAM would lead to constant swapping between system RAM and the GPU, resulting in extremely slow performance, making it practically unusable for real-time applications. The 7168 CUDA cores and 224 Tensor cores would be underutilized due to the VRAM bottleneck. The Ada Lovelace architecture is designed for efficiency, but it cannot overcome the fundamental limitation of insufficient memory.
To run LLaVA 1.6 13B or similar large models on an RTX 4070 SUPER, you must significantly reduce the VRAM footprint. The most effective method is to use quantization. Quantization reduces the precision of the model weights, thereby decreasing the VRAM requirement. Consider using 4-bit or 8-bit quantization.
Alternatively, explore using a smaller model variant, if available, or offload some layers to the CPU. However, CPU offloading will drastically reduce inference speed. Another option is to use cloud-based inference services or a different GPU with more VRAM, such as an RTX 3090, RTX 4080, or RTX 4090. For local use, multiple GPUs may be used if the inference framework supports it.