The NVIDIA Jetson Orin Nano 8GB, with its Ampere architecture, 1024 CUDA cores, and 32 Tensor Cores, offers a suitable platform for running the BGE-Large-EN embedding model. The Orin Nano's 8GB of LPDDR5 VRAM provides ample headroom for the model, which requires only 0.7GB in FP16 precision. This leaves a substantial 7.3GB buffer for larger batch sizes and other processes. While the memory bandwidth of 0.07 TB/s is modest, it is sufficient for this relatively small 0.33B parameter model, enabling reasonable inference speeds.
The Ampere architecture's Tensor Cores will accelerate the matrix multiplications inherent in the BGE-Large-EN model, boosting performance. The estimated 90 tokens/sec inference speed is a reasonable expectation, though actual performance will vary based on the specific implementation and workload. The suggested batch size of 32 leverages the available VRAM effectively, maximizing throughput without exceeding memory limits. The Orin Nano's low 15W TDP also makes it suitable for edge deployments where power efficiency is critical.
For optimal performance on the Jetson Orin Nano 8GB, utilize a framework like `llama.cpp` or ONNX Runtime, which are known for their efficiency on resource-constrained devices. Experiment with quantization techniques like INT8 or even INT4 to further reduce memory footprint and potentially increase inference speed, though this may come at the cost of slight accuracy degradation. Carefully monitor VRAM usage, especially if running other applications concurrently.
Consider optimizing the input pipeline to minimize data transfer overhead. Pre-process and batch inputs efficiently to fully utilize the available compute resources. If the initial performance is insufficient, profile the application to identify bottlenecks and areas for further optimization, such as kernel tuning or custom operator implementations.