Can I run Llama 3.1 405B on NVIDIA RTX 3090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
810.0GB
Headroom
-786.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, falls significantly short of the 810GB VRAM required to load the Llama 3.1 405B model in FP16 precision. This vast discrepancy means the entire model cannot reside on the GPU's memory simultaneously. While the RTX 3090's memory bandwidth of 0.94 TB/s is substantial, it's irrelevant in this scenario because the model's size necessitates offloading significant portions to system RAM or even disk, which are orders of magnitude slower. The RTX 3090's 10496 CUDA cores and 328 Tensor Cores would theoretically provide good compute performance *if* the model fit into memory, but they are bottlenecked by the memory limitations. The Ampere architecture is powerful, but memory capacity is the limiting factor here.

Even with aggressive quantization, such as INT4, the model would still require approximately 202.5GB of VRAM, far exceeding the RTX 3090's capacity. This implies that even with extreme optimization techniques like CPU offloading, the performance would be severely hampered by the constant data transfer between the GPU and system memory. The limited VRAM will lead to minimal batch sizes and severely restricted context lengths, resulting in extremely slow token generation speeds, likely making interactive use impractical. The high TDP of the RTX 3090 (350W) will be largely wasted as the GPU spends most of its time waiting for data.

lightbulb Recommendation

Due to the immense VRAM requirements of the Llama 3.1 405B model, the RTX 3090 is not a suitable GPU for running it directly. Consider using cloud-based inference services that offer access to GPUs with sufficient VRAM, such as NVIDIA A100 or H100 instances. Alternatively, explore smaller Llama 3 models with fewer parameters that can fit within the RTX 3090's VRAM. Fine-tuning a smaller model on a relevant dataset might provide a more practical solution for your specific needs. Finally, if you absolutely must run the 405B model locally, investigate CPU-based inference or distributed inference across multiple GPUs, but be prepared for extremely slow performance.

tune Recommended Settings

Batch_Size
1
Context_Length
Very limited, try starting with 64 and increase c…
Other_Settings
['CPU offloading', 'Use a fast NVMe SSD for swapping']
Inference_Framework
llama.cpp (for CPU fallback)
Quantization_Suggested
INT4

help Frequently Asked Questions

Is Llama 3.1 405B (405.00B) compatible with NVIDIA RTX 3090? expand_more
No, the RTX 3090 does not have enough VRAM to run Llama 3.1 405B effectively.
What VRAM is needed for Llama 3.1 405B (405.00B)? expand_more
Llama 3.1 405B requires approximately 810GB of VRAM in FP16. Even with INT4 quantization, it needs around 202.5GB.
How fast will Llama 3.1 405B (405.00B) run on NVIDIA RTX 3090? expand_more
Due to insufficient VRAM, Llama 3.1 405B will run extremely slowly on the RTX 3090, likely making interactive use impractical. Expect very low tokens per second due to constant swapping between GPU and system memory.