The primary limiting factor in running large language models (LLMs) like LLaVA 1.6 34B is VRAM. This model, in FP16 precision, requires approximately 68GB of VRAM to load the model weights and perform inference. The NVIDIA Jetson Orin Nano 8GB, as the name suggests, only provides 8GB of VRAM. This creates a significant VRAM headroom deficit of -60GB, meaning the model cannot be loaded entirely onto the GPU. Attempting to run the model directly would result in out-of-memory errors. The memory bandwidth of 0.07 TB/s on the Jetson Orin Nano, while adequate for smaller models, would also become a bottleneck if swapping to system RAM were attempted, severely impacting inference speed. Finally, the Ampere architecture and number of CUDA and Tensor cores, while capable, cannot overcome the fundamental limitation of insufficient VRAM.
Due to the substantial VRAM discrepancy, directly running LLaVA 1.6 34B on the Jetson Orin Nano 8GB is not feasible. Consider smaller models that fit within the 8GB VRAM limit. Alternatively, explore aggressive quantization techniques like Q4_K_M or even lower precisions if supported by the inference framework. Offloading layers to system RAM (CPU) is possible, but will drastically reduce performance, making it unsuitable for real-time or interactive applications. If possible, consider using a more powerful GPU with sufficient VRAM, or leverage cloud-based inference services.