Can I run Llama 3.1 70B (INT8 (8-bit Integer)) on NVIDIA RTX 3090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
70.0GB
Headroom
-46.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090, while a powerful GPU, falls short when running the full Llama 3.1 70B model, even with INT8 quantization. The primary bottleneck is VRAM. Llama 3.1 70B quantized to INT8 requires approximately 70GB of VRAM. The RTX 3090 only offers 24GB, leaving a deficit of 46GB. This discrepancy prevents the model from loading and running effectively, as the entire model and its intermediate computations cannot fit within the GPU's memory. Memory bandwidth, while substantial at 0.94 TB/s, becomes irrelevant if the data cannot reside on the GPU in the first place. The Ampere architecture and its CUDA and Tensor cores cannot be fully utilized due to the VRAM limitation. Consequently, direct inference is not feasible without significant adjustments.

lightbulb Recommendation

Given the VRAM constraint, running the full Llama 3.1 70B model directly on the RTX 3090 is impractical. Consider offloading some layers to system RAM (CPU) using libraries like `llama.cpp` with appropriate flags. This will drastically reduce inference speed. Alternatively, explore smaller Llama 3 models (e.g., 8B or 13B) or consider using a cloud-based GPU service with sufficient VRAM, such as those offered by NelsaHost. Quantization to lower bit precisions like Q4_K_M may reduce VRAM usage further, but will impact accuracy. Another option is to split the model across multiple GPUs if available, but this requires advanced setup and specialized software.

tune Recommended Settings

Batch_Size
1 (or very small, depending on available RAM afte…
Context_Length
Reduce to the smallest acceptable length (e.g., 2…
Other_Settings
['Enable memory mapping (mmap) in llama.cpp', 'Optimize CPU usage for offloaded layers', 'Monitor system RAM usage closely']
Inference_Framework
llama.cpp (with CPU offloading) or vLLM (for mult…
Quantization_Suggested
Q4_K_M (if necessary after CPU offloading)

help Frequently Asked Questions

Is Llama 3.1 70B (70.00B) compatible with NVIDIA RTX 3090? expand_more
No, the full Llama 3.1 70B model is not directly compatible with the NVIDIA RTX 3090 due to insufficient VRAM.
What VRAM is needed for Llama 3.1 70B (70.00B)? expand_more
Llama 3.1 70B requires approximately 70GB of VRAM when quantized to INT8. FP16 requires 140GB of VRAM.
How fast will Llama 3.1 70B (70.00B) run on NVIDIA RTX 3090? expand_more
Due to the VRAM limitation, direct inference is not possible. With CPU offloading, performance will be significantly slower than on a GPU with sufficient VRAM, likely resulting in very low tokens/second. It's highly recommended to use a cloud GPU or smaller model.