The NVIDIA Jetson Orin Nano 8GB faces a significant challenge when attempting to run the Llama 3.3 70B model due to its limited VRAM. Llama 3.3 70B, in FP16 precision, requires approximately 140GB of VRAM to load the entire model. The Orin Nano 8GB only provides 8GB of VRAM, resulting in a massive VRAM deficit of 132GB. This discrepancy means the model cannot be loaded and executed directly on the device without significant modifications.
Furthermore, even if aggressive quantization techniques are applied to reduce the model's memory footprint, the Orin Nano's memory bandwidth of 0.07 TB/s poses a bottleneck. While quantization can decrease VRAM usage, it often increases the computational load, demanding more from the memory subsystem. Given the already constrained memory bandwidth, performance would likely be severely limited, leading to extremely slow inference speeds. The Ampere architecture of the Orin Nano, with its 1024 CUDA cores and 32 Tensor cores, is capable, but the VRAM limitation overshadows its potential.
Due to the substantial VRAM requirements of Llama 3.3 70B, it is not practical to run this model directly on the NVIDIA Jetson Orin Nano 8GB. Even with extreme quantization, the performance would be unacceptably slow. Instead, consider exploring smaller models that fit within the Orin Nano's VRAM capacity, such as smaller Llama 3 variants, or other open-source models designed for edge devices.
Alternatively, offloading inference to a more powerful server with sufficient VRAM is a viable option. Frameworks like NVIDIA Triton Inference Server can facilitate this, allowing the Orin Nano to act as a client, sending inference requests to a remote server. This approach leverages the Orin Nano's capabilities for pre-processing and post-processing while relying on a more robust system for the computationally intensive inference task.