The NVIDIA H100 PCIe, with its 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, is exceptionally well-suited for running the Qwen 2.5 14B model, especially when quantized to INT8. The INT8 quantization reduces the model's VRAM footprint to approximately 14GB, leaving a substantial 66GB of VRAM headroom. This ample headroom allows for larger batch sizes, longer context lengths, and the potential to run multiple model instances concurrently. The H100's Hopper architecture, boasting 14592 CUDA cores and 456 Tensor Cores, is designed for efficient matrix multiplication, a core operation in transformer models like Qwen 2.5, which should lead to very high throughput.
Given the H100's capabilities and the model's size, focus on maximizing throughput by experimenting with larger batch sizes. Start with a batch size of 23, as estimated, and incrementally increase it until you observe diminishing returns in tokens/sec. Utilizing a context length of 131072 tokens is feasible, but monitor performance closely as longer context lengths can impact latency. For optimal performance, use a framework like vLLM or NVIDIA's TensorRT, which are designed to leverage the H100's architecture effectively. Consider further quantization to INT4 or even NF4 to potentially increase batch size and throughput further, but be mindful of potential accuracy trade-offs.