Can I run Llama 3.1 70B (q3_k_m) on NVIDIA RTX 3090 Ti?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
28.0GB
Headroom
-4.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM and 1.01 TB/s memory bandwidth, is a powerful GPU, but it falls short of the VRAM requirements for running Llama 3.1 70B (70.00B) even in its quantized q3_k_m format. This quantization reduces the model's footprint to approximately 28GB, exceeding the 3090 Ti's available 24GB by 4GB. While the Ampere architecture and its 10752 CUDA cores and 336 Tensor cores are well-suited for AI inference, the VRAM limitation is a hard constraint. Running out of VRAM will lead to errors, crashes, or extremely slow performance as the system resorts to swapping memory between the GPU and system RAM, which is significantly slower.

lightbulb Recommendation

Unfortunately, running Llama 3.1 70B (70.00B) in q3_k_m on a single RTX 3090 Ti is not feasible due to insufficient VRAM. Consider using a more aggressive quantization method, such as Q2_K or even Q1_K, if available, although this will come at a cost of reduced accuracy. Alternatively, explore distributed inference solutions that split the model across multiple GPUs or even multiple machines. Another option is to use a smaller model, such as Llama 3.1 8B, which will fit within the 3090 Ti's VRAM. Cloud-based inference services offer another viable alternative, allowing you to run the model without hardware constraints, albeit at a cost per use.

tune Recommended Settings

Batch_Size
1 (test and increase if possible)
Context_Length
Reduce to the minimum acceptable length to conser…
Other_Settings
['Enable memory mapping (mmap) in llama.cpp', 'Utilize CPU offloading if absolutely necessary (very slow)']
Inference_Framework
llama.cpp
Quantization_Suggested
Q2_K or Q1_K (if available and acceptable accurac…

help Frequently Asked Questions

Is Llama 3.1 70B (70.00B) compatible with NVIDIA RTX 3090 Ti? expand_more
No, the RTX 3090 Ti does not have enough VRAM to run Llama 3.1 70B (70.00B) even in q3_k_m quantization.
What VRAM is needed for Llama 3.1 70B (70.00B)? expand_more
A q3_k_m quantized version of Llama 3.1 70B (70.00B) requires approximately 28GB of VRAM.
How fast will Llama 3.1 70B (70.00B) run on NVIDIA RTX 3090 Ti? expand_more
Due to insufficient VRAM, it is unlikely that Llama 3.1 70B (70.00B) will run at a usable speed on the RTX 3090 Ti. Expect errors or extremely slow performance due to memory swapping.