The AMD RX 7900 XT, equipped with 20GB of GDDR6 VRAM and an RDNA 3 architecture, offers ample resources for running the BGE-Large-EN embedding model. BGE-Large-EN, with its 0.33 billion parameters, requires a mere 0.7GB of VRAM in FP16 precision. This leaves a substantial 19.3GB of VRAM headroom, ensuring smooth operation even with larger batch sizes or when running other processes concurrently. The RX 7900 XT's memory bandwidth of 0.8 TB/s further contributes to efficient data transfer, minimizing potential bottlenecks during inference.
While the RX 7900 XT lacks dedicated Tensor Cores found in NVIDIA GPUs, the RDNA 3 architecture's compute units are capable of handling the necessary matrix multiplications for inference. The estimated 63 tokens/second throughput indicates a respectable performance level. However, it's important to note that this is an estimate, and actual performance may vary depending on the chosen inference framework, optimization techniques, and system configuration. The large VRAM headroom also allows for experimentation with larger context lengths than the specified 512 tokens, potentially improving the quality of embeddings for longer input sequences.
Given the generous VRAM headroom, users should prioritize maximizing batch size to improve throughput. Experiment with batch sizes up to the suggested 32, or even higher, while monitoring VRAM usage to avoid exceeding capacity. Using an inference framework optimized for AMD GPUs, such as ONNX Runtime or a ROCm-enabled PyTorch or TensorFlow build, is crucial for achieving optimal performance. Furthermore, consider exploring quantization techniques beyond FP16, such as INT8 or even lower precisions, to potentially increase inference speed without significantly sacrificing accuracy. Finally, ensure that the latest AMD drivers are installed to benefit from the most recent performance optimizations.