The primary limiting factor for running LLaVA 1.6 34B on an AMD RX 7900 XT is the video memory (VRAM). LLaVA 1.6 34B, with 34 billion parameters, requires approximately 68GB of VRAM when using FP16 (half-precision floating point) data types for model weights and activations. The RX 7900 XT is equipped with 20GB of GDDR6 VRAM. This creates a significant shortfall of 48GB, meaning the entire model cannot be loaded onto the GPU simultaneously for inference. While the RX 7900 XT's 0.8 TB/s memory bandwidth is substantial, it doesn't compensate for the lack of sufficient VRAM to hold the model. This VRAM bottleneck will prevent the model from running without significant modifications.
Even if techniques like offloading layers to system RAM were employed, the performance would be severely impacted due to the much slower transfer speeds between the GPU and system memory compared to on-device VRAM. The RDNA 3 architecture of the RX 7900 XT, while capable, lacks Tensor Cores found in NVIDIA GPUs which are specifically designed for accelerating matrix multiplications common in deep learning. This absence further reduces the efficiency of running large language models like LLaVA 1.6, in addition to the VRAM constraint. Therefore, without substantial optimization, the model cannot be run effectively on this GPU.
Due to the substantial VRAM deficit, directly running LLaVA 1.6 34B on the AMD RX 7900 XT is not feasible without significant compromises. The most practical approach would be to explore aggressive quantization techniques. Quantization to INT4 or even INT2 could potentially reduce the VRAM footprint to a manageable size, although this will come at the cost of model accuracy and potentially slower inference speeds. Consider using inference frameworks that support extreme quantization levels, such as llama.cpp, and experiment with different quantization methods to find a balance between VRAM usage and performance.
Alternatively, consider using cloud-based services or platforms that offer access to GPUs with sufficient VRAM, like NVIDIA A100 or H100 instances. Another option is to explore smaller language models that fit within the RX 7900 XT's VRAM capacity. If running locally is essential, explore model parallelism across multiple GPUs if available, though this is a complex setup. Before attempting any deployment, benchmark smaller, compatible models to gain experience with the inference framework and hardware.