The NVIDIA H100 SXM, with its 80GB of HBM3 memory and 3.35 TB/s memory bandwidth, is exceptionally well-suited for running the Gemma 2 27B model, especially when quantized. The q3_k_m quantization brings the model's VRAM footprint down to a mere 10.8GB, leaving a substantial 69.2GB of VRAM headroom. This large headroom allows for increased batch sizes and longer context lengths without encountering memory limitations. The H100's Hopper architecture, featuring 16896 CUDA cores and 528 Tensor Cores, provides ample computational power for efficient inference.
Given the H100's high memory bandwidth, data transfer bottlenecks are minimized, ensuring the Tensor Cores are kept busy. The H100's raw compute power is significantly underutilized by a q3_k_m quantized Gemma 2 27B model. This means there is plenty of capacity to increase the model size, run multiple models concurrently, or increase the batch size and context length to improve throughput and user experience. The estimated 90 tokens/sec indicates a very responsive and practical inference speed for most applications.
For optimal performance, leverage the ample VRAM headroom by increasing the batch size to the maximum supported by your inference framework and application. Consider experimenting with larger context lengths if your application requires it. While q3_k_m offers a small memory footprint, evaluate higher precision quantization levels (e.g., q4_k_m or even FP16 if running multiple models concurrently isn't a priority) to potentially improve output quality, as the H100 has the resources to handle it. If you encounter performance bottlenecks, profile your application to identify the specific bottlenecks and optimize accordingly.
Explore distributed inference options if you plan to scale to even larger models in the future. While the H100 can handle Gemma 2 27B with ease, future models may require distributing the workload across multiple GPUs. Be sure to choose an inference framework that supports distributed inference and is optimized for NVIDIA GPUs.