The AMD RX 7900 XTX, while a powerful GPU with 24GB of GDDR6 VRAM and 0.96 TB/s memory bandwidth, falls short of the VRAM requirements for running Llama 3 70B quantized to q3_k_m. This quantization brings the model's VRAM footprint down to 28GB, still exceeding the GPU's capacity by 4GB. The RDNA 3 architecture lacks dedicated Tensor Cores, impacting the potential for optimized matrix multiplication operations crucial for LLM inference. This VRAM deficiency will prevent the model from loading completely, resulting in a failure to run inference without significant modifications.
Due to the VRAM limitations, running Llama 3 70B on the RX 7900 XTX requires exploring more aggressive quantization methods or offloading layers to system RAM. Consider using a lower quantization level such as Q2_K or even Q4_0, understanding that this will impact the model's accuracy to some degree. Alternatively, explore methods for splitting the model across the GPU and system RAM, though this will significantly reduce inference speed due to the slower transfer rates between GPU and CPU. Another option is to use a smaller model, such as Llama 3 8B, which should fit comfortably within the 24GB VRAM of the RX 7900 XTX.