Can I run LLaVA 1.6 13B on NVIDIA Jetson Orin Nano 8GB?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
8.0GB
Required
26.0GB
Headroom
-18.0GB

VRAM Usage

0GB 100% used 8.0GB

info Technical Analysis

The NVIDIA Jetson Orin Nano 8GB, with its 8GB of LPDDR5 VRAM, falls significantly short of the 26GB VRAM required to run LLaVA 1.6 13B in FP16 precision. This discrepancy means the entire model cannot be loaded onto the GPU simultaneously, leading to out-of-memory errors. While the Orin Nano's Ampere architecture provides 1024 CUDA cores and 32 Tensor cores for accelerating computations, the limited memory bandwidth of 0.07 TB/s will further constrain performance even if workarounds are employed to partially load the model. The model's 13 billion parameters necessitate substantial memory for both weights and activations during inference, exacerbating the VRAM bottleneck.

Even if techniques like offloading layers to system RAM are used, the slow transfer speeds between the system RAM and the GPU will drastically reduce inference speed, making real-time or interactive applications impractical. The Orin Nano's 15W TDP, designed for power efficiency, further limits the achievable computational throughput. The combination of insufficient VRAM and constrained memory bandwidth means that the LLaVA 1.6 13B model is fundamentally incompatible with the Jetson Orin Nano 8GB without significant compromises.

lightbulb Recommendation

Due to the severe VRAM limitations, running LLaVA 1.6 13B on the Jetson Orin Nano 8GB is not recommended. While quantization could reduce the VRAM footprint, even aggressive quantization to 4-bit (INT4) may not bring the model size down to a manageable level within the available 8GB. If you must use the Orin Nano, consider using a smaller vision-language model that fits within the VRAM constraints, or explore cloud-based inference solutions where the model runs on a more powerful server. Alternatively, utilize the Orin Nano for pre-processing and offload the LLaVA inference to another machine.

If you are determined to run some version of LLaVA locally, experiment with extreme quantization and offloading layers to system RAM, but be prepared for extremely slow inference speeds. Focus on minimizing batch size and context length to reduce memory usage. Consider a smaller model, like a 7B parameter variant, as a more realistic option for the available hardware.

tune Recommended Settings

Batch_Size
1
Context_Length
512
Other_Settings
['Offload as many layers as possible to system RAM', 'Enable memory mapping', 'Reduce image resolution']
Inference_Framework
llama.cpp
Quantization_Suggested
Q4_K_M (4-bit quantization)

help Frequently Asked Questions

Is LLaVA 1.6 13B compatible with NVIDIA Jetson Orin Nano 8GB? expand_more
No, LLaVA 1.6 13B is not directly compatible with the NVIDIA Jetson Orin Nano 8GB due to insufficient VRAM.
What VRAM is needed for LLaVA 1.6 13B? expand_more
LLaVA 1.6 13B requires approximately 26GB of VRAM in FP16 precision.
How fast will LLaVA 1.6 13B run on NVIDIA Jetson Orin Nano 8GB? expand_more
LLaVA 1.6 13B is unlikely to run at a usable speed on the NVIDIA Jetson Orin Nano 8GB. Even with aggressive quantization and offloading, performance will be severely limited due to VRAM constraints and memory bandwidth limitations, resulting in very low tokens/second.