The primary limiting factor for running LLaVA 1.6 13B on an NVIDIA RTX 3080 12GB is the VRAM capacity. LLaVA 1.6 13B, when using FP16 (half-precision floating point) data types, requires approximately 26GB of VRAM to load the model and perform inference. The RTX 3080 12GB only provides 12GB of VRAM, resulting in a shortfall of 14GB. This discrepancy means the model, in its native FP16 format, cannot be loaded entirely onto the GPU, leading to out-of-memory errors or requiring offloading to system RAM, which significantly degrades performance.
While the RTX 3080 12GB boasts a respectable memory bandwidth of 0.91 TB/s and a substantial number of CUDA and Tensor cores, these specifications become less relevant when the model cannot fully reside in VRAM. The high memory bandwidth would only be beneficial if the model could be processed efficiently on the GPU. Similarly, the CUDA and Tensor cores would be underutilized due to the constant data transfer between system RAM and the GPU. The Ampere architecture of the RTX 3080 is powerful, but VRAM limitations bottleneck its potential in this scenario. Without adequate VRAM, estimating tokens per second or batch size is not feasible, as the system will likely struggle to even initiate inference without significant optimization.
To run LLaVA 1.6 13B on an RTX 3080 12GB, you must employ quantization techniques to reduce the model's memory footprint. Quantization involves reducing the precision of the model's weights, thereby decreasing the VRAM requirements. Consider using 4-bit or 8-bit quantization. Frameworks like `llama.cpp` and `vLLM` offer excellent support for quantization and efficient inference. Be aware that quantization will likely impact the model's accuracy, but it is a necessary trade-off to enable execution on a GPU with limited VRAM.
Alternatively, explore cloud-based inference services or consider using a GPU with more VRAM. Cloud services offer the advantage of accessing powerful GPUs on demand, while upgrading your GPU would provide a more seamless and performant experience. If neither of these options is feasible, investigate techniques like model parallelism, where the model is split across multiple GPUs. However, this approach adds significant complexity to the setup and is generally not recommended for beginners.