Can I run Llama 3 70B (INT8 (8-bit Integer)) on NVIDIA RTX 3090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
70.0GB
Headroom
-46.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, falls significantly short of the memory requirements for running Llama 3 70B, even in its INT8 quantized form. Llama 3 70B, a 70 billion parameter model, demands approximately 70GB of VRAM when quantized to INT8 precision. The RTX 3090's memory bandwidth of 0.94 TB/s, while substantial, would likely become a bottleneck if offloading to system RAM were attempted, leading to a drastic performance decrease. The Ampere architecture of the RTX 3090, including its 10496 CUDA cores and 328 Tensor cores, is theoretically capable of accelerating the computations, but the limited VRAM prevents the entire model from residing on the GPU, making efficient inference impossible. Without enough VRAM, the model cannot be loaded, and therefore no tokens can be generated.

lightbulb Recommendation

Due to the VRAM limitations of the RTX 3090, running Llama 3 70B directly is not feasible. Consider using a smaller model variant, such as Llama 3 8B or 15B, which can fit within the 24GB VRAM. Alternatively, explore cloud-based inference services or platforms that offer access to GPUs with sufficient memory. Distributed inference across multiple GPUs is another option, but it requires significant technical expertise and infrastructure. If you are committed to running Llama 3 70B locally, consider upgrading to a GPU with significantly more VRAM (48GB or more).

tune Recommended Settings

Batch_Size
N/A - Model will not load
Context_Length
N/A - Model will not load
Other_Settings
['Experiment with CPU offloading using llama.cpp, but expect very slow performance.', 'Consider using a smaller model like Llama 3 8B or 15B']
Inference_Framework
llama.cpp (for CPU offloading experimentation), v…
Quantization_Suggested
No other quantization levels will allow the model…

help Frequently Asked Questions

Is Llama 3 70B (70.00B) compatible with NVIDIA RTX 3090? expand_more
No, the RTX 3090 does not have enough VRAM to run Llama 3 70B, even with INT8 quantization.
What VRAM is needed for Llama 3 70B (70.00B)? expand_more
Llama 3 70B requires approximately 70GB of VRAM when quantized to INT8.
How fast will Llama 3 70B (70.00B) run on NVIDIA RTX 3090? expand_more
Llama 3 70B will not run on the RTX 3090 due to insufficient VRAM. No tokens will be generated.