The NVIDIA Jetson Orin Nano 8GB, with its 8GB of LPDDR5 VRAM, falls significantly short of the 26GB VRAM required to run LLaVA 1.6 13B in FP16 precision. This discrepancy means the entire model cannot be loaded onto the GPU simultaneously, leading to out-of-memory errors. While the Orin Nano's Ampere architecture provides 1024 CUDA cores and 32 Tensor cores for accelerating computations, the limited memory bandwidth of 0.07 TB/s will further constrain performance even if workarounds are employed to partially load the model. The model's 13 billion parameters necessitate substantial memory for both weights and activations during inference, exacerbating the VRAM bottleneck.
Even if techniques like offloading layers to system RAM are used, the slow transfer speeds between the system RAM and the GPU will drastically reduce inference speed, making real-time or interactive applications impractical. The Orin Nano's 15W TDP, designed for power efficiency, further limits the achievable computational throughput. The combination of insufficient VRAM and constrained memory bandwidth means that the LLaVA 1.6 13B model is fundamentally incompatible with the Jetson Orin Nano 8GB without significant compromises.
Due to the severe VRAM limitations, running LLaVA 1.6 13B on the Jetson Orin Nano 8GB is not recommended. While quantization could reduce the VRAM footprint, even aggressive quantization to 4-bit (INT4) may not bring the model size down to a manageable level within the available 8GB. If you must use the Orin Nano, consider using a smaller vision-language model that fits within the VRAM constraints, or explore cloud-based inference solutions where the model runs on a more powerful server. Alternatively, utilize the Orin Nano for pre-processing and offload the LLaVA inference to another machine.
If you are determined to run some version of LLaVA locally, experiment with extreme quantization and offloading layers to system RAM, but be prepared for extremely slow inference speeds. Focus on minimizing batch size and context length to reduce memory usage. Consider a smaller model, like a 7B parameter variant, as a more realistic option for the available hardware.