The primary limiting factor for running LLaVA 1.6 13B on an NVIDIA RTX 4070 Ti is the GPU's VRAM capacity. LLaVA 1.6 13B, when running in FP16 precision, requires approximately 26GB of VRAM to load the model and its associated data. The RTX 4070 Ti, equipped with 12GB of GDDR6X memory, falls significantly short of this requirement. This discrepancy means the model cannot be loaded entirely onto the GPU, leading to errors or preventing inference altogether. Memory bandwidth, while important for performance, is secondary in this scenario since the model cannot even fit in the available memory. CUDA and Tensor core counts are also irrelevant if the model can't be loaded.
To run LLaVA 1.6 13B, consider using quantization techniques to reduce the model's memory footprint. Quantization to 4-bit (Q4) or 8-bit (Q8) can significantly decrease VRAM usage, potentially bringing it within the RTX 4070 Ti's 12GB limit. Alternatively, explore offloading layers to system RAM (CPU), although this will severely impact performance. As a last resort, consider using a cloud-based GPU service or upgrading to a GPU with more VRAM, such as an RTX 3090, RTX 4080, or any of the professional NVIDIA A-series or H-series cards.