The NVIDIA H100 SXM, with its 80GB of HBM3 memory and 3.35 TB/s memory bandwidth, offers ample resources for running the Phi-3 Medium 14B model, especially when quantized to INT8. Phi-3 Medium 14B, requiring only 14GB of VRAM in INT8, leaves a substantial 66GB of VRAM headroom. This abundant VRAM allows for larger batch sizes and longer context lengths, maximizing GPU utilization. The H100's 16896 CUDA cores and 528 Tensor Cores are more than sufficient to handle the computational demands of this model, promising excellent inference speeds.
Furthermore, the high memory bandwidth of the H100 ensures that data can be transferred between the GPU and memory quickly, preventing bottlenecks during inference. The Hopper architecture's optimized Tensor Cores accelerate matrix multiplications, which are a core component of transformer-based models like Phi-3. This combination of high VRAM, memory bandwidth, and specialized hardware allows for efficient and fast inference. The estimated 90 tokens/sec demonstrates the powerful performance achievable with this configuration.
Given the H100's capabilities, prioritize maximizing batch size to improve throughput. Experiment with different batch sizes, starting with the estimated 23, and monitor GPU utilization. If the GPU isn't fully utilized, increase the batch size further. Utilizing a framework like vLLM or NVIDIA's TensorRT will provide further optimization and potentially increase tokens/sec. Also, consider using techniques like speculative decoding to further boost inference speed.
While INT8 quantization provides a good balance of performance and memory usage, you could also experiment with FP16 or BF16 if higher precision is required and the slight increase in VRAM usage is acceptable. However, for most applications, INT8 should provide sufficient accuracy with significant performance benefits.