The NVIDIA Jetson Orin Nano 8GB, with its Ampere architecture, 1024 CUDA cores, and 32 Tensor cores, offers a capable platform for AI inference, especially considering its low 15W TDP. However, its primary limitation when running large language models like LLaVA 1.6 7B is its 8GB of LPDDR5 VRAM. LLaVA 1.6 7B, in FP16 precision, requires approximately 14GB of VRAM to load and operate efficiently. This creates a significant shortfall of 6GB, meaning the model cannot be loaded in FP16 without encountering out-of-memory errors. The 70 GB/s memory bandwidth, while decent for the Orin Nano's class, further constrains performance when attempting to work around the VRAM limitation through techniques like offloading layers to system RAM.
Due to the VRAM limitation, running LLaVA 1.6 7B in FP16 on the Jetson Orin Nano 8GB is not feasible. The most viable approach is to aggressively quantize the model. Consider using 4-bit quantization (Q4_K_S or similar) via llama.cpp or similar inference frameworks. This will significantly reduce the VRAM footprint, potentially bringing it within the 8GB limit. Be aware that quantization will impact accuracy, and extensive testing is recommended to ensure acceptable performance for your specific application. Another option is to explore smaller vision-language models that have lower VRAM requirements.