The primary limiting factor for running large language models (LLMs) like Llama 3 70B is VRAM capacity. The AMD RX 7900 XTX, with its 24GB of GDDR6 VRAM, falls short of the 35GB required to load the Q4_K_M quantized version of the model. While quantization reduces the memory footprint compared to the FP16 (140GB) unquantized version, it's still insufficient. Memory bandwidth, while substantial at 0.96 TB/s, becomes a bottleneck only after the model is loaded into VRAM; in this case, it's not the immediate issue. The absence of dedicated Tensor Cores on the RX 7900 XTX means that computations will rely on the GPU's shaders, which can impact inference speed compared to GPUs with optimized AI acceleration hardware. The RDNA 3 architecture provides good general compute capabilities, but it is not specifically optimized for AI workloads like NVIDIA's Tensor Cores.
Due to the VRAM limitation, running Llama 3 70B on the RX 7900 XTX is not directly feasible without significant modifications or workarounds. Consider using a smaller model variant (e.g., Llama 3 8B or 15B), which would fit within the available VRAM. Alternatively, explore offloading layers to system RAM, though this will drastically reduce inference speed. Another option is to utilize distributed inference across multiple GPUs, although this requires a more complex setup and specialized software. If sticking with the 70B model is essential, upgrading to a GPU with more VRAM is the most straightforward solution.