The AMD RX 7900 XT, with its 20GB of GDDR6 VRAM and RDNA 3 architecture, is exceptionally well-suited for running the BGE-Small-EN embedding model. BGE-Small-EN, with only 0.03 billion parameters, requires a mere 0.1GB of VRAM in FP16 precision. This leaves a massive 19.9GB of VRAM headroom, allowing for significant batch processing and concurrent execution of multiple instances of the model without memory constraints. The RX 7900 XT's 0.8 TB/s memory bandwidth ensures that data can be transferred quickly between the GPU and memory, further enhancing performance.
While the RX 7900 XT lacks dedicated tensor cores found in NVIDIA GPUs, the RDNA 3 architecture's compute units are still capable of efficiently handling the matrix multiplications involved in the BGE-Small-EN model. Given the model's small size, performance is primarily limited by memory bandwidth and compute throughput rather than VRAM capacity. Expect excellent inference speeds and the ability to process large batches of text embeddings concurrently. The estimated 63 tokens/second reflects the raw processing power available, and this can be further optimized with appropriate software and settings.
The AMD RX 7900 XT is an excellent choice for running BGE-Small-EN. To maximize performance, use a framework like ONNX Runtime or DirectML that is optimized for AMD GPUs. Experiment with different batch sizes to find the optimal balance between throughput and latency. Since VRAM is not a limiting factor, increasing the batch size to the suggested 32 or even higher will likely improve overall tokens/second.
Consider using mixed precision (FP16 or even INT8 quantization, if supported by your chosen framework) to further improve performance. While BGE-Small-EN is already small, quantization can reduce memory footprint and improve compute efficiency. Monitor GPU utilization to ensure that the GPU is being fully utilized; if not, increase batch size or run multiple inference processes concurrently.