Can I run LLaVA 1.6 34B on NVIDIA RTX 3080 12GB?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
12.0GB
Required
68.0GB
Headroom
-56.0GB

VRAM Usage

0GB 100% used 12.0GB

info Technical Analysis

The primary limiting factor in running large language models (LLMs) like LLaVA 1.6 34B is VRAM. This model, in FP16 precision, requires approximately 68GB of VRAM to load and operate effectively. The NVIDIA RTX 3080 12GB, while a powerful card for gaming and some AI tasks, only provides 12GB of VRAM. This creates a significant shortfall of 56GB, preventing the model from being loaded in its entirety onto the GPU. Without sufficient VRAM, the system will either fail to load the model, or experience extremely slow performance due to constant swapping between system RAM and GPU VRAM, effectively making inference impractical. Memory bandwidth, while important, is secondary to VRAM capacity in this scenario. The RTX 3080's 0.91 TB/s memory bandwidth is substantial, but irrelevant if the model cannot fit within the available memory.

lightbulb Recommendation

Due to the substantial VRAM deficit, running LLaVA 1.6 34B on an RTX 3080 12GB in FP16 is not feasible. To make this model runnable, you would need to explore aggressive quantization techniques, such as Q4 or even lower bit precisions. Using llama.cpp or similar frameworks allows for CPU offloading, but this will severely impact performance. A more practical approach would be to consider using a smaller model variant, such as a 7B or 13B parameter model, which can fit within the 12GB VRAM. Alternatively, cloud-based inference services or GPUs with higher VRAM capacity (e.g., RTX 4090 or professional GPUs) are better suited for running such large models.

tune Recommended Settings

Batch_Size
1
Context_Length
Lower context length (e.g., 512 or 1024) to reduc…
Other_Settings
['CPU offloading layers (using `n_gpu_layers` in llama.cpp) to balance GPU and CPU usage', 'Use `--mlock` to prevent swapping to disk (if sufficient RAM is available)', 'Experiment with different quantization schemes to find the best balance between performance and accuracy']
Inference_Framework
llama.cpp (for CPU offloading and quantization) o…
Quantization_Suggested
Q4_K_S or lower (e.g., Q2_K) with llama.cpp

help Frequently Asked Questions

Is LLaVA 1.6 34B compatible with NVIDIA RTX 3080 12GB? expand_more
No, LLaVA 1.6 34B is not directly compatible with an NVIDIA RTX 3080 12GB due to insufficient VRAM.
What VRAM is needed for LLaVA 1.6 34B? expand_more
LLaVA 1.6 34B requires approximately 68GB of VRAM when using FP16 precision.
How fast will LLaVA 1.6 34B run on NVIDIA RTX 3080 12GB? expand_more
Without significant quantization and CPU offloading, LLaVA 1.6 34B will likely not run at all on an RTX 3080 12GB. Even with aggressive optimizations, expect very slow performance, potentially several seconds per token.