Can I run Llama 3.1 405B (q3_k_m) on NVIDIA RTX 4090?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
24.0GB
Required
162.0GB
Headroom
-138.0GB

VRAM Usage

0GB 100% used 24.0GB

info Technical Analysis

The primary limiting factor for running large language models (LLMs) like Llama 3.1 405B is VRAM capacity. This model, even when quantized to q3_k_m, requires approximately 162GB of VRAM to load and operate. The NVIDIA RTX 4090, while a powerful GPU, is equipped with only 24GB of VRAM. This creates a significant shortfall of 138GB, rendering the direct loading and inference of the entire model impossible on a single RTX 4090. The high memory bandwidth of the RTX 4090 (1.01 TB/s) is irrelevant in this scenario because the model cannot even fit into the available memory.

Even if techniques like offloading some layers to system RAM were attempted, the performance would be severely degraded due to the much slower bandwidth of system RAM compared to GPU VRAM. The CUDA cores and Tensor cores of the RTX 4090 would remain largely underutilized as the bottleneck becomes the constant data transfer between system RAM and the GPU. Therefore, a single RTX 4090 is insufficient for practical inference with Llama 3.1 405B, even with aggressive quantization.

lightbulb Recommendation

Given the VRAM limitations, running Llama 3.1 405B on a single RTX 4090 is not feasible. Consider exploring alternative solutions such as using a cloud-based service that offers access to GPUs with sufficient VRAM (e.g., NVIDIA A100, H100), or splitting the model across multiple GPUs using model parallelism techniques, which requires significant technical expertise and infrastructure. Another option would be to explore smaller models with fewer parameters that can fit within the RTX 4090's VRAM, although this would come at the cost of reduced model capabilities.

If you are set on running a large model locally, investigate techniques like CPU offloading with llama.cpp, understanding that inference speed will be substantially slower. Ensure you have a fast CPU and ample system RAM to mitigate the performance hit as much as possible. Furthermore, explore extreme quantization methods, even at the cost of accuracy, to see if a minimally acceptable performance level can be achieved. However, be prepared for very slow inference speeds.

tune Recommended Settings

Batch_Size
1
Context_Length
Reduce as much as possible to minimize VRAM usage
Other_Settings
['CPU offloading (experiment with different layer counts)', 'Use mmap to reduce RAM usage', 'Optimize system RAM speed']
Inference_Framework
llama.cpp (for CPU offloading)
Quantization_Suggested
q4_0 or lower (experiment with different levels)

help Frequently Asked Questions

Is Llama 3.1 405B (405.00B) compatible with NVIDIA RTX 4090? expand_more
No, Llama 3.1 405B is not directly compatible with an NVIDIA RTX 4090 due to insufficient VRAM.
What VRAM is needed for Llama 3.1 405B (405.00B)? expand_more
Llama 3.1 405B requires approximately 162GB of VRAM when quantized to q3_k_m.
How fast will Llama 3.1 405B (405.00B) run on NVIDIA RTX 4090? expand_more
Llama 3.1 405B will likely not run on an RTX 4090 without significant modifications. If CPU offloading is used, expect very slow inference speeds, potentially several seconds or minutes per token.