Can I run Llama 3 70B (INT8 (8-bit Integer)) on AMD RX 7900 XTX?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
70.0GB
Headroom
-46.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The AMD RX 7900 XTX, with its 24GB of GDDR6 VRAM, falls short of the VRAM requirements for running Llama 3 70B, even in its INT8 quantized form. Llama 3 70B quantized to INT8 requires approximately 70GB of VRAM to load the model and perform inference. The deficit of 46GB means the entire model cannot reside on the GPU's memory simultaneously, preventing successful execution without significant workarounds. While the RX 7900 XTX boasts a substantial 0.96 TB/s memory bandwidth, this bandwidth cannot compensate for the lack of sufficient on-device VRAM. The absence of Tensor Cores on the AMD GPU also impacts performance, as specialized hardware acceleration for matrix multiplication operations, crucial for LLM inference, is unavailable, leading to slower processing times compared to GPUs with dedicated Tensor Cores.

lightbulb Recommendation

Due to the significant VRAM shortfall, directly running Llama 3 70B on the RX 7900 XTX is not feasible without employing techniques like offloading layers to system RAM or using extremely aggressive quantization methods. Layer offloading will severely impact performance, as data transfer between system RAM and the GPU is much slower than accessing VRAM. Consider using a smaller model variant, such as Llama 3 8B or 15B, which have significantly lower VRAM requirements and can potentially fit within the RX 7900 XTX's 24GB VRAM. Alternatively, explore distributed inference solutions that split the model across multiple GPUs or machines if high performance is critical.

tune Recommended Settings

Batch_Size
1
Context_Length
2048
Other_Settings
['Enable memory mapping (mmap) in llama.cpp to reduce RAM usage.', 'Experiment with CPU offloading, but be aware of the performance penalty.', 'Reduce the number of layers loaded onto the GPU.']
Inference_Framework
llama.cpp
Quantization_Suggested
Try Q4_K_M or even lower quantization levels, but…

help Frequently Asked Questions

Is Llama 3 70B (70.00B) compatible with AMD RX 7900 XTX? expand_more
No, the RX 7900 XTX does not have enough VRAM (24GB) to run Llama 3 70B (70GB INT8) effectively.
What VRAM is needed for Llama 3 70B (70.00B)? expand_more
Llama 3 70B requires approximately 140GB of VRAM in FP16 or 70GB in INT8 quantization.
How fast will Llama 3 70B (70.00B) run on AMD RX 7900 XTX? expand_more
Due to insufficient VRAM, direct inference is not possible. If workarounds are implemented (CPU offloading, extreme quantization), expect very slow performance, likely rendering it unusable for real-time applications.