The NVIDIA RTX 3090, with its 24GB of GDDR6X VRAM, falls significantly short of the memory requirements for running Llama 3 70B, even in its INT8 quantized form. Llama 3 70B, a 70 billion parameter model, demands approximately 70GB of VRAM when quantized to INT8 precision. The RTX 3090's memory bandwidth of 0.94 TB/s, while substantial, would likely become a bottleneck if offloading to system RAM were attempted, leading to a drastic performance decrease. The Ampere architecture of the RTX 3090, including its 10496 CUDA cores and 328 Tensor cores, is theoretically capable of accelerating the computations, but the limited VRAM prevents the entire model from residing on the GPU, making efficient inference impossible. Without enough VRAM, the model cannot be loaded, and therefore no tokens can be generated.
Due to the VRAM limitations of the RTX 3090, running Llama 3 70B directly is not feasible. Consider using a smaller model variant, such as Llama 3 8B or 15B, which can fit within the 24GB VRAM. Alternatively, explore cloud-based inference services or platforms that offer access to GPUs with sufficient memory. Distributed inference across multiple GPUs is another option, but it requires significant technical expertise and infrastructure. If you are committed to running Llama 3 70B locally, consider upgrading to a GPU with significantly more VRAM (48GB or more).