Can I run Llama 3.3 70B on NVIDIA Jetson Orin Nano 8GB?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
8.0GB
Required
140.0GB
Headroom
-132.0GB

VRAM Usage

0GB 100% used 8.0GB

info Technical Analysis

The NVIDIA Jetson Orin Nano 8GB faces a significant challenge when attempting to run the Llama 3.3 70B model due to its limited VRAM. Llama 3.3 70B, in FP16 precision, requires approximately 140GB of VRAM to load the entire model. The Orin Nano 8GB only provides 8GB of VRAM, resulting in a massive VRAM deficit of 132GB. This discrepancy means the model cannot be loaded and executed directly on the device without significant modifications.

Furthermore, even if aggressive quantization techniques are applied to reduce the model's memory footprint, the Orin Nano's memory bandwidth of 0.07 TB/s poses a bottleneck. While quantization can decrease VRAM usage, it often increases the computational load, demanding more from the memory subsystem. Given the already constrained memory bandwidth, performance would likely be severely limited, leading to extremely slow inference speeds. The Ampere architecture of the Orin Nano, with its 1024 CUDA cores and 32 Tensor cores, is capable, but the VRAM limitation overshadows its potential.

lightbulb Recommendation

Due to the substantial VRAM requirements of Llama 3.3 70B, it is not practical to run this model directly on the NVIDIA Jetson Orin Nano 8GB. Even with extreme quantization, the performance would be unacceptably slow. Instead, consider exploring smaller models that fit within the Orin Nano's VRAM capacity, such as smaller Llama 3 variants, or other open-source models designed for edge devices.

Alternatively, offloading inference to a more powerful server with sufficient VRAM is a viable option. Frameworks like NVIDIA Triton Inference Server can facilitate this, allowing the Orin Nano to act as a client, sending inference requests to a remote server. This approach leverages the Orin Nano's capabilities for pre-processing and post-processing while relying on a more robust system for the computationally intensive inference task.

tune Recommended Settings

Batch_Size
1
Context_Length
Potentially reduce to 2048 or lower to save memor…
Other_Settings
['Enable memory offloading to system RAM (expect significant slowdown)', 'Use a smaller model entirely']
Inference_Framework
llama.cpp (for experimentation with extreme quant…
Quantization_Suggested
Q2_K or even lower (if attempting local execution…

help Frequently Asked Questions

Is Llama 3.3 70B compatible with NVIDIA Jetson Orin Nano 8GB? expand_more
No, it is not directly compatible due to insufficient VRAM.
What VRAM is needed for Llama 3.3 70B? expand_more
Llama 3.3 70B requires approximately 140GB of VRAM in FP16 precision.
How fast will Llama 3.3 70B run on NVIDIA Jetson Orin Nano 8GB? expand_more
It will likely be too slow to be usable for most applications, even with extreme quantization. Offloading inference to a more powerful server is recommended.