The primary limiting factor for running large language models (LLMs) like Llama 3.1 405B is VRAM capacity. This model, even when quantized to q3_k_m, requires approximately 162GB of VRAM to load and operate. The NVIDIA RTX 4090, while a powerful GPU, is equipped with only 24GB of VRAM. This creates a significant shortfall of 138GB, rendering the direct loading and inference of the entire model impossible on a single RTX 4090. The high memory bandwidth of the RTX 4090 (1.01 TB/s) is irrelevant in this scenario because the model cannot even fit into the available memory.
Even if techniques like offloading some layers to system RAM were attempted, the performance would be severely degraded due to the much slower bandwidth of system RAM compared to GPU VRAM. The CUDA cores and Tensor cores of the RTX 4090 would remain largely underutilized as the bottleneck becomes the constant data transfer between system RAM and the GPU. Therefore, a single RTX 4090 is insufficient for practical inference with Llama 3.1 405B, even with aggressive quantization.
Given the VRAM limitations, running Llama 3.1 405B on a single RTX 4090 is not feasible. Consider exploring alternative solutions such as using a cloud-based service that offers access to GPUs with sufficient VRAM (e.g., NVIDIA A100, H100), or splitting the model across multiple GPUs using model parallelism techniques, which requires significant technical expertise and infrastructure. Another option would be to explore smaller models with fewer parameters that can fit within the RTX 4090's VRAM, although this would come at the cost of reduced model capabilities.
If you are set on running a large model locally, investigate techniques like CPU offloading with llama.cpp, understanding that inference speed will be substantially slower. Ensure you have a fast CPU and ample system RAM to mitigate the performance hit as much as possible. Furthermore, explore extreme quantization methods, even at the cost of accuracy, to see if a minimally acceptable performance level can be achieved. However, be prepared for very slow inference speeds.