The NVIDIA Jetson Orin Nano 8GB is fundamentally incompatible with the DeepSeek-V3 model due to a massive VRAM disparity. DeepSeek-V3, with its 671 billion parameters, requires approximately 1342GB of VRAM when using FP16 precision. The Jetson Orin Nano, equipped with only 8GB of VRAM, falls drastically short. This means the entire model cannot be loaded onto the GPU for inference. Even with aggressive quantization, the memory footprint of DeepSeek-V3 remains substantial, exceeding the Orin Nano's capabilities by a significant margin.
Furthermore, even if VRAM limitations were somehow addressed, the memory bandwidth of the Jetson Orin Nano (70 GB/s) would present a bottleneck. Loading model weights and transferring data between the CPU and GPU would be slow, resulting in extremely poor inference speeds. The Ampere architecture and Tensor Cores of the Orin Nano are designed for acceleration, but they cannot overcome the fundamental limitations imposed by insufficient memory and bandwidth. The model size is simply too large for the available resources, making real-time or even near-real-time inference infeasible.
Given the hardware constraints, directly running DeepSeek-V3 on the Jetson Orin Nano 8GB is not practical. Instead of trying to run the full model, consider exploring smaller, more efficient models that are specifically designed for edge devices with limited resources. Distillation techniques can be used to create smaller models that retain much of the performance of the larger model.
Alternatively, you could explore offloading inference to a more powerful server with sufficient VRAM and processing power. The Jetson Orin Nano could then act as a client, sending requests to the server and receiving the results. This approach allows you to leverage the capabilities of DeepSeek-V3 without overwhelming the limited resources of the edge device. Consider also using cloud-based inference services.