Can I run Llama 3 70B (q3_k_m) on NVIDIA RTX 3090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
28.0GB
Headroom
-4.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090, equipped with 24GB of GDDR6X VRAM, falls short of the 28GB VRAM requirement for running Llama 3 70B (70.00B) using the q3_k_m quantization method. This discrepancy of 4GB means the model, even in its quantized form, cannot fully reside on the GPU's memory. Consequently, the system will either fail to load the model or experience severe performance degradation due to constant swapping of data between the GPU and system RAM. While the RTX 3090 boasts a high memory bandwidth of 0.94 TB/s and a substantial number of CUDA and Tensor cores, these specifications are rendered less effective when the model exceeds the available VRAM, creating a bottleneck that significantly impacts inference speed and overall usability.

lightbulb Recommendation

To successfully run Llama 3 70B (70.00B) on the RTX 3090, consider further quantization to reduce the VRAM footprint. Quantization methods like Q2_K or even Q1_K could potentially bring the model's VRAM requirements within the 24GB limit. Alternatively, explore offloading some layers to the CPU, although this will significantly reduce inference speed. If performance remains unsatisfactory, consider using a smaller model like Llama 3 8B, which is designed to run efficiently on GPUs with less VRAM, or upgrading to a GPU with more VRAM.

tune Recommended Settings

Batch_Size
1
Context_Length
2048 (start with a smaller context length and inc…
Other_Settings
['Use --threads to maximize CPU utilization during offloading', 'Enable memory mapping (mmap) in llama.cpp to reduce RAM usage']
Inference_Framework
llama.cpp
Quantization_Suggested
Q2_K or Q1_K (experiment to find the best balance…

help Frequently Asked Questions

Is Llama 3 70B (70.00B) compatible with NVIDIA RTX 3090? expand_more
No, the RTX 3090's 24GB VRAM is insufficient for the 28GB required by the q3_k_m quantized version of Llama 3 70B (70.00B).
What VRAM is needed for Llama 3 70B (70.00B)? expand_more
The VRAM requirement varies based on the quantization level. For q3_k_m, it needs 28GB. Lower quantization levels like Q2_K or Q1_K can reduce this, but will impact model accuracy.
How fast will Llama 3 70B (70.00B) run on NVIDIA RTX 3090? expand_more
Due to insufficient VRAM, running Llama 3 70B (70.00B) on the RTX 3090 will likely result in very slow performance or failure to load. Expect extremely low tokens/second if offloading is used. A smaller model will perform significantly better.