The primary limiting factor for running LLaVA 1.6 13B on the AMD RX 7900 XTX is the GPU's VRAM capacity. LLaVA 1.6 13B in FP16 (half-precision floating point) requires approximately 26GB of VRAM to load the model and perform inference. The RX 7900 XTX is equipped with 24GB of GDDR6 VRAM, resulting in a 2GB shortfall. This means that without employing specific optimization techniques, the model will likely not fit entirely within the GPU's memory, leading to errors or preventing the model from loading altogether. Memory bandwidth, while substantial at 0.96 TB/s, becomes less relevant when the model cannot fully reside in VRAM, as data swapping between system RAM and GPU memory would introduce significant performance bottlenecks.
Furthermore, the RX 7900 XTX lacks dedicated Tensor Cores, which are specialized hardware accelerators found in NVIDIA GPUs designed to accelerate matrix multiplications, a core operation in deep learning. While the RDNA 3 architecture does include matrix multiplication acceleration, it is generally not as efficient as dedicated Tensor Cores. Consequently, even if the VRAM issue is addressed, the inference speed may be slower compared to a similarly priced NVIDIA GPU with Tensor Cores. Without sufficient VRAM, estimating tokens per second or optimal batch size is impossible, as the model simply won't run in its entirety without optimizations.
To run LLaVA 1.6 13B on the RX 7900 XTX, you must reduce the VRAM footprint. The most effective method is quantization. Consider quantizing the model to 8-bit integers (INT8) or even 4-bit integers (INT4). Quantization significantly reduces the memory required to store the model weights, potentially bringing it within the 24GB VRAM limit. Using llama.cpp is highly recommended as it supports various quantization methods and is optimized for AMD GPUs. Experiment with different quantization levels to find a balance between VRAM usage and model accuracy.
Alternatively, explore offloading some layers of the model to system RAM. However, be aware that this will severely impact performance due to the slower data transfer rates between system RAM and the GPU. Monitoring VRAM usage during inference is crucial. If the model still exceeds the VRAM limit after quantization, further reduce the context length or batch size, but these will affect the quality and throughput of your inference. If possible, consider using a different GPU with more VRAM.