Can I run Llama 3 70B (INT8 (8-bit Integer)) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
70.0GB
Headroom
-46.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM, falls significantly short of the VRAM required to run Llama 3 70B, even in its INT8 quantized form. Llama 3 70B, quantized to INT8, demands approximately 70GB of VRAM. This discrepancy of -46GB indicates that the model's weights and activations cannot be fully loaded onto the GPU's memory. Consequently, attempting to run the model directly on the RTX 3090 Ti will result in out-of-memory errors, preventing successful inference. While the RTX 3090 Ti boasts a high memory bandwidth of 1.01 TB/s and a substantial number of CUDA and Tensor cores, these features are rendered irrelevant when the model exceeds the available VRAM.

lightbulb Recommendation

Given the VRAM limitations, running Llama 3 70B on a single RTX 3090 Ti is not feasible. Consider exploring alternative solutions such as model parallelism across multiple GPUs, CPU offloading (which will drastically reduce performance), or utilizing cloud-based GPU instances with sufficient VRAM (e.g., A100, H100). Another option is to explore smaller models or more aggressive quantization techniques (e.g., 4-bit quantization) that reduce the VRAM footprint, although this will likely impact the model's accuracy and output quality. If you proceed with CPU offloading, expect a significant decrease in tokens/second.

tune Recommended Settings

Batch_Size
1
Context_Length
2048 (adjust based on available resources after q…
Other_Settings
['Enable memory offloading', 'Optimize attention mechanisms', 'Use a smaller model variant']
Inference_Framework
llama.cpp or ExllamaV2
Quantization_Suggested
4-bit quantization (if feasible) or consider smal…

help Frequently Asked Questions

Is Llama 3 70B (70.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, the RTX 3090 Ti's 24GB VRAM is insufficient for Llama 3 70B, even with INT8 quantization.
What VRAM is needed for Llama 3 70B (70.00B)? expand_more
Llama 3 70B requires approximately 70GB of VRAM when quantized to INT8.
How fast will Llama 3 70B (70.00B) run on NVIDIA RTX 3090 Ti? expand_more
It will not run directly due to insufficient VRAM. Workarounds like CPU offloading will severely limit performance.