The primary limiting factor for running LLaVA 1.6 34B on an NVIDIA RTX 4080 SUPER is the VRAM. LLaVA 1.6 34B, when operating in FP16 (half-precision floating point), necessitates approximately 68GB of VRAM to load the model and perform inference. The RTX 4080 SUPER is equipped with 16GB of GDDR6X memory. This creates a substantial deficit of 52GB, meaning the model, in its full FP16 precision, cannot fit entirely within the GPU's memory.
Beyond VRAM, memory bandwidth plays a crucial role in inference speed. The RTX 4080 SUPER offers 740 GB/s of memory bandwidth, which is respectable. However, even if VRAM were sufficient, the model's size would still impose a performance bottleneck. The limited memory bandwidth will cause frequent data transfers between the GPU and system memory (if offloading is attempted), significantly reducing the tokens/second generation rate. Without sufficient VRAM, batch processing and context length will also be severely restricted, making interactive or complex tasks impractical.
Because the model cannot be fully loaded onto the GPU, real-time or even near-real-time performance is not achievable without significant compromises. The estimated tokens/second and batch size are unavailable because the model cannot run in its default configuration. Performance relies heavily on techniques like quantization and offloading, which would significantly reduce speed.
Given the VRAM limitations, direct execution of LLaVA 1.6 34B in FP16 on the RTX 4080 SUPER is not feasible. To run this model, you must employ aggressive quantization techniques. Consider using 4-bit quantization (Q4) or even lower precisions, which can drastically reduce the VRAM footprint, potentially bringing it within the 16GB limit. Frameworks like llama.cpp or ExLlamaV2 are designed for efficient quantized inference.
If quantization alone doesn't suffice, explore offloading layers to system RAM. This approach will significantly degrade performance, but it might allow you to experiment with the model. Prioritize quantizing the model before resorting to offloading. As an alternative, consider using a smaller model, such as LLaVA 1.5 7B, which requires significantly less VRAM and is more likely to run effectively on the RTX 4080 SUPER.