The primary limiting factor in running large language models (LLMs) like LLaVA 1.6 34B is VRAM. This model, in FP16 precision, requires approximately 68GB of VRAM to load and operate effectively. The NVIDIA RTX 3080 12GB, while a powerful card for gaming and some AI tasks, only provides 12GB of VRAM. This creates a significant shortfall of 56GB, preventing the model from being loaded in its entirety onto the GPU. Without sufficient VRAM, the system will either fail to load the model, or experience extremely slow performance due to constant swapping between system RAM and GPU VRAM, effectively making inference impractical. Memory bandwidth, while important, is secondary to VRAM capacity in this scenario. The RTX 3080's 0.91 TB/s memory bandwidth is substantial, but irrelevant if the model cannot fit within the available memory.
Due to the substantial VRAM deficit, running LLaVA 1.6 34B on an RTX 3080 12GB in FP16 is not feasible. To make this model runnable, you would need to explore aggressive quantization techniques, such as Q4 or even lower bit precisions. Using llama.cpp or similar frameworks allows for CPU offloading, but this will severely impact performance. A more practical approach would be to consider using a smaller model variant, such as a 7B or 13B parameter model, which can fit within the 12GB VRAM. Alternatively, cloud-based inference services or GPUs with higher VRAM capacity (e.g., RTX 4090 or professional GPUs) are better suited for running such large models.