The NVIDIA RTX 3090 Ti, with its 24GB of GDDR6X VRAM, provides ample memory for running the Phi-3 Medium 14B model, especially when using INT8 quantization. The model requires approximately 14GB of VRAM in INT8, leaving a substantial 10GB headroom. This headroom is beneficial for handling larger context lengths, increasing batch sizes, or running other applications concurrently. The 3090 Ti's 1.01 TB/s memory bandwidth ensures efficient data transfer between the GPU and memory, which is crucial for maintaining high inference speeds.
Furthermore, the RTX 3090 Ti's Ampere architecture, equipped with 10752 CUDA cores and 336 Tensor cores, is well-suited for accelerating the matrix multiplications and other computations inherent in large language models. Tensor cores, specifically designed for deep learning workloads, significantly boost performance compared to using only CUDA cores. The estimated 60 tokens/sec performance is a reasonable expectation given the model size and GPU capabilities. However, actual performance can vary based on the specific inference framework used and the level of optimization applied.
To maximize performance, it's recommended to use an optimized inference framework such as `llama.cpp` or `vLLM`, which are designed to leverage the RTX 3090 Ti's hardware capabilities effectively. Experiment with different batch sizes to find the optimal balance between throughput and latency. While a batch size of 3 is a good starting point, increasing it might improve tokens/sec, but could also increase latency. Consider using techniques like speculative decoding if available in your chosen framework to further boost performance.
Also, ensure that the drivers are up to date to benefit from the latest performance improvements and bug fixes. Monitor GPU utilization and memory usage during inference to identify potential bottlenecks. If the model still doesn't fit in VRAM or if performance is insufficient, consider using further quantization to INT4, or offloading some layers to CPU, though this will impact speed.