The NVIDIA RTX 3070 Ti, with its 8GB of GDDR6X VRAM, is an excellent match for the BGE-Large-EN embedding model. BGE-Large-EN, at 0.33B parameters, requires a mere 0.7GB of VRAM when using FP16 precision. This leaves a significant 7.3GB of VRAM headroom on the RTX 3070 Ti, ensuring that the model and associated operations can easily fit within the GPU's memory. The RTX 3070 Ti's memory bandwidth of 0.61 TB/s is more than sufficient to handle the data transfer needs of this model, preventing memory bandwidth from becoming a performance bottleneck.
Furthermore, the RTX 3070 Ti's 6144 CUDA cores and 192 Tensor cores contribute to efficient computation, especially during inference. The Ampere architecture provides hardware-accelerated FP16 support, which is beneficial for BGE-Large-EN. The estimated 90 tokens/sec and batch size of 32 are realistic expectations given the model size and GPU capabilities. These figures may vary depending on the specific inference framework and system configuration.
Given the ample VRAM headroom, you can experiment with larger batch sizes or potentially run multiple instances of the BGE-Large-EN model concurrently on the RTX 3070 Ti. Consider using an optimized inference framework like vLLM or FasterTransformer to maximize throughput. While FP16 provides a good balance of speed and accuracy, if you encounter any numerical instability issues, you can revert to BF16 if your framework supports it. Monitor GPU utilization to ensure optimal resource allocation and identify any potential bottlenecks.