The primary limiting factor for running LLaVA 1.6 13B on the AMD RX 7800 XT is the VRAM. LLaVA 1.6 13B in FP16 precision requires approximately 26GB of VRAM to load the model and handle intermediate computations during inference. The RX 7800 XT is equipped with 16GB of GDDR6 VRAM, resulting in a VRAM deficit of 10GB. This means the model, in its full FP16 form, cannot be loaded entirely onto the GPU, leading to out-of-memory errors or reliance on significantly slower system RAM, which drastically impacts performance.
While the RX 7800 XT's memory bandwidth of 0.62 TB/s is respectable, the limited VRAM is the bottleneck. Even if data could be swapped efficiently between system RAM and GPU memory, the sheer volume of data transfer required would negate any performance gains from the GPU's compute capabilities. The lack of dedicated Tensor Cores on the RX 7800 XT further complicates matters, as it means that any acceleration of matrix multiplications (a core component of transformer models like LLaVA) will need to be handled by the general-purpose compute units (CUDA cores), resulting in lower throughput compared to GPUs with dedicated Tensor Cores.
To run LLaVA 1.6 13B on the RX 7800 XT, you'll need to significantly reduce the model's memory footprint. The most effective method is to use quantization. Quantization reduces the precision of the model's weights, thereby lowering the VRAM requirement. Consider using a 4-bit quantization method. This will reduce VRAM usage to around 6.5GB, which is well within the RX 7800 XT's capacity.
Experiment with different inference frameworks like llama.cpp, which is known for its efficient CPU and GPU utilization, or consider using a framework that supports offloading layers to system RAM if VRAM is still a constraint, but be aware that this will severely impact performance. Evaluate the performance impact of quantization on your specific use case to find the right balance between VRAM usage and output quality.