The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM, falls significantly short of the VRAM required to run Llama 3 70B, even in its INT8 quantized form. Llama 3 70B, quantized to INT8, demands approximately 70GB of VRAM. This discrepancy of -46GB indicates that the model's weights and activations cannot be fully loaded onto the GPU's memory. Consequently, attempting to run the model directly on the RTX 3090 Ti will result in out-of-memory errors, preventing successful inference. While the RTX 3090 Ti boasts a high memory bandwidth of 1.01 TB/s and a substantial number of CUDA and Tensor cores, these features are rendered irrelevant when the model exceeds the available VRAM.
Given the VRAM limitations, running Llama 3 70B on a single RTX 3090 Ti is not feasible. Consider exploring alternative solutions such as model parallelism across multiple GPUs, CPU offloading (which will drastically reduce performance), or utilizing cloud-based GPU instances with sufficient VRAM (e.g., A100, H100). Another option is to explore smaller models or more aggressive quantization techniques (e.g., 4-bit quantization) that reduce the VRAM footprint, although this will likely impact the model's accuracy and output quality. If you proceed with CPU offloading, expect a significant decrease in tokens/second.