Can I run Llama 3.1 70B (INT8 (8-bit Integer)) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
70.0GB
Headroom
-46.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM and Ampere architecture, is a powerful GPU, but falls short when running Llama 3.1 70B in INT8 quantization. While the 3090 Ti boasts a memory bandwidth of 1.01 TB/s and 10752 CUDA cores, the sheer size of the quantized model (70GB VRAM) exceeds the available memory. This discrepancy means the entire model cannot reside on the GPU, leading to out-of-memory errors or requiring offloading to system RAM, which drastically reduces performance due to the significantly slower data transfer rates between the GPU and system memory.

Even with INT8 quantization, which reduces the model's memory footprint compared to FP16, the 70GB VRAM requirement remains substantial. The 3090 Ti's 336 Tensor Cores would be beneficial for accelerating the matrix multiplications inherent in LLM inference, but they cannot be fully utilized when the model exceeds the available VRAM. Consequently, the estimated tokens per second and batch size are currently unavailable, as the model will likely not run without significant modifications or alternative configurations.

lightbulb Recommendation

Given the VRAM limitation, directly running Llama 3.1 70B on the RTX 3090 Ti is not feasible without compromising performance. Consider using a more aggressive quantization method such as Q4_K_S or Q5_K_M (4-bit or 5-bit quantization) if supported by your chosen inference framework. Alternatively, explore distributed inference across multiple GPUs, if possible. If neither of these options are viable, consider using a smaller model that fits within the 3090 Ti's VRAM, such as Llama 3.1 8B, or utilizing cloud-based inference services.

Another potential avenue is to investigate CPU offloading, but be aware that this will significantly reduce inference speed. Ensure you have a fast CPU and ample system RAM if you pursue this approach. Experiment with different inference frameworks like `llama.cpp` which provide various quantization and offloading options to optimize performance for your specific setup.

tune Recommended Settings

Batch_Size
1 (adjust based on available VRAM after quantizat…
Context_Length
Reduce to a smaller context length to save memory…
Other_Settings
['Enable GPU layers in llama.cpp', 'Experiment with CPU offloading as a last resort']
Inference_Framework
llama.cpp
Quantization_Suggested
Q4_K_S or Q5_K_M

help Frequently Asked Questions

Is Llama 3.1 70B (70.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, not directly. The 24GB VRAM of the RTX 3090 Ti is insufficient to hold the 70GB INT8 quantized Llama 3.1 70B model.
What VRAM is needed for Llama 3.1 70B (70.00B)? expand_more
Llama 3.1 70B requires approximately 140GB VRAM in FP16 or 70GB VRAM in INT8 quantization.
How fast will Llama 3.1 70B (70.00B) run on NVIDIA RTX 3090 Ti? expand_more
Without significant modifications like aggressive quantization or offloading, it will likely not run due to insufficient VRAM. If modifications are applied, performance will be severely limited by the need to swap data between system RAM and GPU memory.