The primary limiting factor for running LLaVA 1.6 34B on an AMD RX 7800 XT is the GPU's VRAM capacity. LLaVA 1.6 34B, in FP16 precision, requires approximately 68GB of VRAM to load the model weights and perform computations. The RX 7800 XT is equipped with 16GB of VRAM, leaving a deficit of 52GB. This significant shortfall prevents the model from being loaded onto the GPU in its native FP16 format. While the RX 7800 XT's memory bandwidth of 0.62 TB/s is respectable, it becomes irrelevant when the model cannot fit within the available VRAM. The lack of dedicated Tensor Cores on the RX 7800 XT further complicates matters, as it necessitates relying on general-purpose compute units for AI tasks, which are less efficient.
Due to the severe VRAM limitations, directly running LLaVA 1.6 34B on the RX 7800 XT is not feasible without significant compromises. Consider using quantization techniques like Q4_K_M or even lower precisions offered by llama.cpp to drastically reduce the model's memory footprint. Offloading layers to system RAM (CPU) is another option, but this will severely impact performance. As a more practical alternative, explore smaller vision language models that can fit within the 16GB VRAM of your GPU, or consider using cloud-based inference services to leverage more powerful hardware. If local execution is a must, investigate distributed inference setups where the model is split across multiple GPUs.