Can I run Llama 3 70B (Q4_K_M (GGUF 4-bit)) on NVIDIA RTX 3090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
35.0GB
Headroom
-11.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, falls short of the VRAM needed to efficiently run the Q4_K_M quantized version of the Llama 3 70B model, which requires approximately 35GB of VRAM. This 11GB deficit will lead to out-of-memory errors, severely degraded performance due to offloading to system RAM (if possible at all), or complete inability to run the model. While the RTX 3090 boasts a high memory bandwidth of 0.94 TB/s and a substantial number of CUDA and Tensor cores, the limiting factor is the insufficient VRAM to hold the entire model in GPU memory. The Ampere architecture is capable, but VRAM capacity is paramount for large language models like Llama 3 70B.

lightbulb Recommendation

Given the VRAM limitations, directly running the Llama 3 70B model on a single RTX 3090 is not feasible. To run this model, consider upgrading to a GPU with significantly more VRAM (48GB or more). Alternatively, explore model parallelism across multiple GPUs, although this requires more complex setup and specialized software. As a more immediate solution, try a more aggressive quantization of the model such as Q2_K. Be aware that aggressive quantization will affect the quality of the model's output. Consider using cloud-based GPU services or renting time on a more powerful machine if local hardware upgrades are not an option.

tune Recommended Settings

Batch_Size
1 (to minimize VRAM usage)
Context_Length
As low as possible (e.g., 512 or 1024)
Other_Settings
['Enable CPU offloading (expect extremely slow performance)', 'Use smaller models with fewer parameters', 'Monitor VRAM usage closely during inference']
Inference_Framework
llama.cpp (for CPU fallback if needed) or exllama…
Quantization_Suggested
Q2_K or lower (experiment to find the best balanc…

help Frequently Asked Questions

Is Llama 3 70B (70.00B) compatible with NVIDIA RTX 3090? expand_more
No, the NVIDIA RTX 3090 does not have enough VRAM to run the Q4_K_M quantized Llama 3 70B model effectively.
What VRAM is needed for Llama 3 70B (70.00B)? expand_more
The Q4_K_M quantized version of Llama 3 70B requires approximately 35GB of VRAM. Higher precision (e.g., FP16) requires significantly more, around 140GB.
How fast will Llama 3 70B (70.00B) run on NVIDIA RTX 3090? expand_more
Due to insufficient VRAM, Llama 3 70B is unlikely to run at all on an RTX 3090 without significant performance degradation from offloading to system RAM. Expect extremely slow token generation speeds, potentially several seconds or minutes per token if it runs at all.