The primary bottleneck for running LLaVA 1.6 13B on an RTX 3070 Ti is the VRAM limitation. LLaVA 1.6 13B in FP16 precision requires approximately 26GB of VRAM to load the model weights and perform inference. The RTX 3070 Ti only offers 8GB of VRAM, leaving a significant 18GB shortfall. This means the model cannot be loaded entirely onto the GPU, leading to out-of-memory errors or requiring offloading to system RAM, which drastically reduces performance. Memory bandwidth, while substantial on the 3070 Ti (0.61 TB/s), becomes less relevant when VRAM capacity is the limiting factor, as data transfer between system RAM and GPU becomes the bottleneck. CUDA and Tensor core counts, while important for compute, are secondary to the VRAM constraint in this scenario. Performance will be severely impacted, likely rendering interactive use impossible without significant optimization.
Due to the severe VRAM limitation, running LLaVA 1.6 13B in its full FP16 precision on the RTX 3070 Ti is not feasible. To make it runnable, you'll need to significantly reduce the model's memory footprint through quantization. Consider using a framework like `llama.cpp` or `text-generation-inference` to load and run the model with aggressive quantization (e.g., Q4_K_M or even lower). This will reduce VRAM usage but at the cost of some accuracy. Offloading layers to CPU RAM is another option, but will dramatically slow down inference. If acceptable performance is still not achievable, consider using a smaller model variant (e.g., a 7B version) or upgrading to a GPU with significantly more VRAM.