The AMD RX 7800 XT, with its 16GB of GDDR6 VRAM and RDNA 3 architecture, exhibits excellent compatibility with the BGE-Large-EN embedding model. BGE-Large-EN, a relatively small model with 0.33 billion parameters, requires only 0.7GB of VRAM when using FP16 precision. This leaves a substantial VRAM headroom of 15.3GB on the RX 7800 XT, ensuring that the model and associated processes can operate comfortably without memory constraints. The RX 7800 XT's memory bandwidth of 0.62 TB/s is also more than adequate for efficiently loading the model weights and processing the data required for inference.
While the RX 7800 XT lacks dedicated Tensor Cores found in NVIDIA GPUs, its 3840 CUDA cores (though technically not CUDA, the analogous compute units) can still provide reasonable performance for AI tasks. The estimated 63 tokens/sec and a batch size of 32 indicate a solid inference speed for this model on this GPU. The RDNA 3 architecture includes optimizations for compute workloads, which contributes to the achieved performance. The absence of Tensor Cores might result in lower performance compared to an equivalent NVIDIA card with Tensor Cores.
Given the ample VRAM headroom, users can experiment with larger batch sizes or running multiple instances of the model concurrently to maximize GPU utilization. While FP16 is sufficient, consider experimenting with INT8 quantization to potentially increase inference speed further, though this may come with a slight reduction in accuracy. Monitor GPU utilization and temperature during extended use to ensure the card is operating within safe thermal limits, especially considering its 263W TDP. For even greater optimization, explore using inference frameworks specifically designed for AMD GPUs, such as those leveraging ROCm.