The AMD RX 7900 XTX, with its 24GB of GDDR6 VRAM and RDNA 3 architecture, is exceptionally well-suited for running the BGE-Large-EN embedding model. BGE-Large-EN, being a relatively small model with only 0.33 billion parameters, requires a mere 0.7GB of VRAM in FP16 precision. This leaves a substantial 23.3GB of VRAM headroom, allowing for large batch sizes and concurrent execution of other tasks. The RX 7900 XTX's 0.96 TB/s memory bandwidth ensures that data can be transferred efficiently between the GPU and memory, preventing memory bottlenecks that could limit performance.
While the RX 7900 XTX lacks dedicated Tensor Cores found in NVIDIA GPUs, its RDNA 3 architecture provides sufficient computational power for efficient inference. The estimated tokens/second rate of 63 is a solid starting point, and can be further optimized through software-level optimizations such as quantization and optimized inference frameworks. The large VRAM capacity enables experimentation with larger batch sizes, potentially improving throughput and overall performance. However, note that AMD's ROCm software stack can sometimes present challenges compared to NVIDIA's CUDA ecosystem, so careful driver selection and framework configuration are important.
Given the ample VRAM available on the RX 7900 XTX, users should experiment with increasing the batch size to maximize throughput. Starting with a batch size of 32, as estimated, is a good baseline. Explore using optimized inference frameworks like ONNX Runtime or potentially adapting the model to work with projects like `llama.cpp` (although primarily designed for LLMs, its optimization techniques can be beneficial). Thoroughly test different driver versions, as AMD driver performance can vary, and monitor GPU utilization to identify potential bottlenecks.
Consider using quantization techniques, even though the model is already small. Quantization to INT8 or even lower precision can further reduce VRAM usage and potentially increase inference speed, although it might come at the cost of a slight reduction in accuracy. Always validate the accuracy after applying any quantization method to ensure it meets the required performance criteria. Also, be sure to use ROCm and ensure that the correct version is installed and configured for your chosen framework.