Gemma 2 27B on H100: Compatibility & Performance

info Technical Analysis

The NVIDIA H100 SXM, with its 80GB of HBM3 memory and 3.35 TB/s memory bandwidth, is exceptionally well-suited for running the Gemma 2 27B model, especially when quantized. The q3_k_m quantization brings the model's VRAM footprint down to a mere 10.8GB, leaving a substantial 69.2GB of VRAM headroom. This large headroom allows for increased batch sizes and longer context lengths without encountering memory limitations. The H100's Hopper architecture, featuring 16896 CUDA cores and 528 Tensor Cores, provides ample computational power for efficient inference.

Given the H100's high memory bandwidth, data transfer bottlenecks are minimized, ensuring the Tensor Cores are kept busy. The H100's raw compute power is significantly underutilized by a q3_k_m quantized Gemma 2 27B model. This means there is plenty of capacity to increase the model size, run multiple models concurrently, or increase the batch size and context length to improve throughput and user experience. The estimated 90 tokens/sec indicates a very responsive and practical inference speed for most applications.

lightbulb Recommendation

For optimal performance, leverage the ample VRAM headroom by increasing the batch size to the maximum supported by your inference framework and application. Consider experimenting with larger context lengths if your application requires it. While q3_k_m offers a small memory footprint, evaluate higher precision quantization levels (e.g., q4_k_m or even FP16 if running multiple models concurrently isn't a priority) to potentially improve output quality, as the H100 has the resources to handle it. If you encounter performance bottlenecks, profile your application to identify the specific bottlenecks and optimize accordingly.

Explore distributed inference options if you plan to scale to even larger models in the future. While the H100 can handle Gemma 2 27B with ease, future models may require distributing the workload across multiple GPUs. Be sure to choose an inference framework that supports distributed inference and is optimized for NVIDIA GPUs.

tune Recommended Settings

Batch_Size

12

Context_Length

8192

Other_Settings

['Enable CUDA graph capture', 'Use Pytorch 2.0 or higher', 'Enable XQA']

Inference_Framework

vLLM

Quantization_Suggested

q4_k_m

help Frequently Asked Questions

Is Gemma 2 27B (27.00B) compatible with NVIDIA H100 SXM? expand_more

Yes, Gemma 2 27B is perfectly compatible with the NVIDIA H100 SXM, offering substantial VRAM headroom.

What VRAM is needed for Gemma 2 27B (27.00B)? expand_more

With q3_k_m quantization, Gemma 2 27B requires approximately 10.8GB of VRAM.

How fast will Gemma 2 27B (27.00B) run on NVIDIA H100 SXM? expand_more

Expect an estimated inference speed of around 90 tokens/sec with the suggested quantization and batch size.

NelsaHost

Can I run Gemma 2 27B (q3_k_m) on NVIDIA H100 SXM?

VRAM Usage

Performance Estimate

info Technical Analysis

lightbulb Recommendation

tune Recommended Settings

help Frequently Asked Questions

GPU

AI Model

Alternative Quantizations

More with H100 SXM