Can I run Llama 3 70B (q3_k_m) on AMD RX 7900 XTX?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
28.0GB
Headroom
-4.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The AMD RX 7900 XTX, while a powerful GPU with 24GB of GDDR6 VRAM and 0.96 TB/s memory bandwidth, falls short of the VRAM requirements for running Llama 3 70B quantized to q3_k_m. This quantization brings the model's VRAM footprint down to 28GB, still exceeding the GPU's capacity by 4GB. The RDNA 3 architecture lacks dedicated Tensor Cores, impacting the potential for optimized matrix multiplication operations crucial for LLM inference. This VRAM deficiency will prevent the model from loading completely, resulting in a failure to run inference without significant modifications.

lightbulb Recommendation

Due to the VRAM limitations, running Llama 3 70B on the RX 7900 XTX requires exploring more aggressive quantization methods or offloading layers to system RAM. Consider using a lower quantization level such as Q2_K or even Q4_0, understanding that this will impact the model's accuracy to some degree. Alternatively, explore methods for splitting the model across the GPU and system RAM, though this will significantly reduce inference speed due to the slower transfer rates between GPU and CPU. Another option is to use a smaller model, such as Llama 3 8B, which should fit comfortably within the 24GB VRAM of the RX 7900 XTX.

tune Recommended Settings

Batch_Size
1
Context_Length
2048
Other_Settings
['Use --gpu-layers as high as possible without exceeding VRAM', 'Experiment with different quantization methods to balance performance and accuracy', 'Monitor VRAM usage to ensure the model fits within the available memory']
Inference_Framework
llama.cpp
Quantization_Suggested
Q4_0 or Q2_K

help Frequently Asked Questions

Is Llama 3 70B (70.00B) compatible with AMD RX 7900 XTX? expand_more
No, not without significant quantization or offloading. The RX 7900 XTX has insufficient VRAM to load the Llama 3 70B model, even when quantized to q3_k_m.
What VRAM is needed for Llama 3 70B (70.00B)? expand_more
The VRAM needed depends on the quantization level. At FP16, it requires 140GB. Quantized to q3_k_m, it needs 28GB. Lower quantization levels will reduce this further.
How fast will Llama 3 70B (70.00B) run on AMD RX 7900 XTX? expand_more
Due to VRAM limitations, running the model directly is not feasible. If successful with aggressive quantization and offloading, expect significantly reduced inference speed compared to GPUs with sufficient VRAM, likely in the range of 1-3 tokens/second.