Can I run LLaVA 1.6 34B on NVIDIA RTX 3060 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
8.0GB
Required
68.0GB
Headroom
-60.0GB

VRAM Usage

0GB 100% used 8.0GB

info Technical Analysis

The primary limiting factor in running large language models (LLMs) like LLaVA 1.6 34B is the GPU's VRAM. LLaVA 1.6 34B, in FP16 precision, requires approximately 68GB of VRAM to load the model and perform inference. The NVIDIA RTX 3060 Ti, with its 8GB of VRAM, falls significantly short of this requirement. This means the model cannot be loaded entirely onto the GPU, leading to out-of-memory errors. While the RTX 3060 Ti's Ampere architecture, 4864 CUDA cores, and 152 Tensor cores are capable for smaller models, the sheer size of LLaVA 1.6 34B overwhelms the available memory. The 450 GB/s memory bandwidth, while decent, is secondary to the VRAM constraint in this scenario.

Attempting to run the model without sufficient VRAM will result in errors. Even if techniques like CPU offloading are employed, the performance will be severely degraded due to the slow transfer speeds between the CPU and GPU. The model's parameters simply cannot fit within the GPU's memory space, making real-time or even near-real-time inference impossible. The RTX 3060 Ti's Tensor Cores would be beneficial for accelerating matrix multiplications within the model if enough VRAM was available, but they are rendered useless by the memory limitation.

lightbulb Recommendation

Given the substantial VRAM deficit, running LLaVA 1.6 34B directly on an RTX 3060 Ti is not feasible without significant compromises. Consider using a smaller model that fits within the 8GB VRAM, such as a 7B parameter model. Alternatively, explore cloud-based solutions like Google Colab Pro or cloud GPU instances from providers like AWS, Azure, or GCP, which offer GPUs with sufficient VRAM (e.g., A100, H100, or similar).

If using the RTX 3060 Ti is a must, aggressive quantization techniques like 4-bit or even 3-bit quantization via llama.cpp could potentially reduce the memory footprint, but this will come at the cost of accuracy. Even with extreme quantization, performance will likely be very slow and potentially unstable. CPU offloading could be attempted, but performance will be significantly impacted due to the slower CPU-GPU memory transfer speeds.

tune Recommended Settings

Batch_Size
1
Context_Length
512
Other_Settings
['CPU offloading (expect very slow performance)', 'Reduce image resolution for LLaVA', 'Use a smaller model']
Inference_Framework
llama.cpp
Quantization_Suggested
q4_k_m or lower

help Frequently Asked Questions

Is LLaVA 1.6 34B compatible with NVIDIA RTX 3060 Ti? expand_more
No, LLaVA 1.6 34B is not directly compatible with the NVIDIA RTX 3060 Ti due to insufficient VRAM.
What VRAM is needed for LLaVA 1.6 34B? expand_more
LLaVA 1.6 34B requires approximately 68GB of VRAM in FP16 precision.
How fast will LLaVA 1.6 34B run on NVIDIA RTX 3060 Ti? expand_more
LLaVA 1.6 34B is unlikely to run at a usable speed on an RTX 3060 Ti. Even with aggressive quantization and CPU offloading, performance will likely be very slow (potentially several seconds per token) and may be unstable.