The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM and Ampere architecture, is a powerful GPU, but falls short when running Llama 3.1 70B in INT8 quantization. While the 3090 Ti boasts a memory bandwidth of 1.01 TB/s and 10752 CUDA cores, the sheer size of the quantized model (70GB VRAM) exceeds the available memory. This discrepancy means the entire model cannot reside on the GPU, leading to out-of-memory errors or requiring offloading to system RAM, which drastically reduces performance due to the significantly slower data transfer rates between the GPU and system memory.
Even with INT8 quantization, which reduces the model's memory footprint compared to FP16, the 70GB VRAM requirement remains substantial. The 3090 Ti's 336 Tensor Cores would be beneficial for accelerating the matrix multiplications inherent in LLM inference, but they cannot be fully utilized when the model exceeds the available VRAM. Consequently, the estimated tokens per second and batch size are currently unavailable, as the model will likely not run without significant modifications or alternative configurations.
Given the VRAM limitation, directly running Llama 3.1 70B on the RTX 3090 Ti is not feasible without compromising performance. Consider using a more aggressive quantization method such as Q4_K_S or Q5_K_M (4-bit or 5-bit quantization) if supported by your chosen inference framework. Alternatively, explore distributed inference across multiple GPUs, if possible. If neither of these options are viable, consider using a smaller model that fits within the 3090 Ti's VRAM, such as Llama 3.1 8B, or utilizing cloud-based inference services.
Another potential avenue is to investigate CPU offloading, but be aware that this will significantly reduce inference speed. Ensure you have a fast CPU and ample system RAM if you pursue this approach. Experiment with different inference frameworks like `llama.cpp` which provide various quantization and offloading options to optimize performance for your specific setup.