The primary limiting factor in running LLaVA 1.6 34B on an AMD RX 7900 XTX is the VRAM. LLaVA 1.6 34B, in FP16 (half-precision floating point) format, requires approximately 68GB of VRAM to load the model weights and perform inference. The RX 7900 XTX, however, only has 24GB of VRAM. This means the model cannot be loaded onto the GPU in its entirety, leading to a 'FAIL' verdict for direct compatibility. The significant VRAM deficit of 44GB prevents even basic inference from being performed.
While the RX 7900 XTX boasts a substantial memory bandwidth of 0.96 TB/s, this is irrelevant in this scenario because the model cannot even be loaded. Even if the model could be squeezed into VRAM via aggressive quantization, the lack of dedicated Tensor Cores on the RX 7900 XTX would result in significantly slower performance compared to NVIDIA GPUs with Tensor Cores. Without optimization techniques like quantization or offloading layers to system RAM, running this model directly on the RX 7900 XTX is not feasible. Given the large model size and limited VRAM, expect extremely slow or non-functional operation without significant modifications.
To run LLaVA 1.6 34B with the RX 7900 XTX, aggressive quantization techniques are essential. Consider using 4-bit quantization (Q4) which significantly reduces the VRAM footprint. However, this will come at the cost of some accuracy. Even with quantization, it might be necessary to offload some layers to system RAM (CPU), which will drastically slow down inference speed.
Alternatively, explore using a smaller model variant, such as LLaVA 1.5 7B or 13B, which require significantly less VRAM. If the 34B model is absolutely necessary, consider using cloud-based inference services or upgrading to a GPU with more VRAM, such as an NVIDIA RTX 4090 or an AMD Instinct MI250X.