Can I run Llama 3 70B (q3_k_m) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
28.0GB
Headroom
-4.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM and Ampere architecture, is a powerful GPU, but it falls short of the VRAM requirements for running Llama 3 70B even with quantization. While the card boasts a high memory bandwidth of 1.01 TB/s and a substantial number of CUDA and Tensor cores (10752 and 336 respectively), the sheer size of the 70 billion parameter model necessitates more VRAM. The provided Q3_K_M quantization brings the VRAM footprint down to 28GB, which is still 4GB over the 3090 Ti's capacity. This VRAM deficit will prevent the model from loading and running effectively, leading to out-of-memory errors. The architecture itself would be capable enough to handle the compute load if the model could fit in memory.

lightbulb Recommendation

Given the VRAM limitation, running Llama 3 70B on a single RTX 3090 Ti is not feasible. Several options exist: First, consider using a smaller model variant of Llama 3, such as the 8B or 7B versions, which are designed to fit within smaller memory footprints. These smaller models will sacrifice some performance and accuracy but can run without modification. Second, investigate model parallelism across multiple GPUs. This involves splitting the model across several GPUs, each holding a portion of the parameters. This requires more complex software setup and is not always straightforward. Finally, explore offloading some layers to system RAM (CPU), which is slower but allows you to run larger models. This will dramatically decrease inference speed.

tune Recommended Settings

Batch_Size
1 (if offloading to CPU) or experiment with small…
Context_Length
Reduce to 2048 or 4096 tokens to potentially lowe…
Other_Settings
['Enable CUDA graph capture to reduce CPU overhead.', 'Use paged attention if supported by the inference framework to improve memory efficiency.', 'Monitor VRAM usage closely during inference and adjust settings accordingly.']
Inference_Framework
llama.cpp (for CPU offloading) or vLLM (for multi…
Quantization_Suggested
Q4_K_M or even Q5_K_M if using CPU offloading to …

help Frequently Asked Questions

Is Llama 3 70B (70.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, the RTX 3090 Ti's 24GB VRAM is insufficient to run Llama 3 70B, even with Q3_K_M quantization which requires 28GB.
What VRAM is needed for Llama 3 70B (70.00B)? expand_more
Llama 3 70B requires at least 140GB of VRAM in FP16 precision. Quantization can reduce this, but even with Q3_K_M, it still needs around 28GB of VRAM.
How fast will Llama 3 70B (70.00B) run on NVIDIA RTX 3090 Ti? expand_more
Llama 3 70B will likely not run on the RTX 3090 Ti due to insufficient VRAM. If you manage to run it with extreme quantization and CPU offloading, expect very slow performance (potentially less than 1 token/second).