Can I run Gemma 2 27B (q3_k_m) on NVIDIA H100 SXM?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
10.8GB
Headroom
+69.2GB

VRAM Usage

0GB 14% used 80.0GB

Performance Estimate

Tokens/sec ~90.0
Batch size 12
Context 8192K

info Technical Analysis

The NVIDIA H100 SXM, with its 80GB of HBM3 memory and 3.35 TB/s memory bandwidth, is exceptionally well-suited for running the Gemma 2 27B model, especially when quantized. The q3_k_m quantization brings the model's VRAM footprint down to a mere 10.8GB, leaving a substantial 69.2GB of VRAM headroom. This large headroom allows for increased batch sizes and longer context lengths without encountering memory limitations. The H100's Hopper architecture, featuring 16896 CUDA cores and 528 Tensor Cores, provides ample computational power for efficient inference.

Given the H100's high memory bandwidth, data transfer bottlenecks are minimized, ensuring the Tensor Cores are kept busy. The H100's raw compute power is significantly underutilized by a q3_k_m quantized Gemma 2 27B model. This means there is plenty of capacity to increase the model size, run multiple models concurrently, or increase the batch size and context length to improve throughput and user experience. The estimated 90 tokens/sec indicates a very responsive and practical inference speed for most applications.

lightbulb Recommendation

For optimal performance, leverage the ample VRAM headroom by increasing the batch size to the maximum supported by your inference framework and application. Consider experimenting with larger context lengths if your application requires it. While q3_k_m offers a small memory footprint, evaluate higher precision quantization levels (e.g., q4_k_m or even FP16 if running multiple models concurrently isn't a priority) to potentially improve output quality, as the H100 has the resources to handle it. If you encounter performance bottlenecks, profile your application to identify the specific bottlenecks and optimize accordingly.

Explore distributed inference options if you plan to scale to even larger models in the future. While the H100 can handle Gemma 2 27B with ease, future models may require distributing the workload across multiple GPUs. Be sure to choose an inference framework that supports distributed inference and is optimized for NVIDIA GPUs.

tune Recommended Settings

Batch_Size
12
Context_Length
8192
Other_Settings
['Enable CUDA graph capture', 'Use Pytorch 2.0 or higher', 'Enable XQA']
Inference_Framework
vLLM
Quantization_Suggested
q4_k_m

help Frequently Asked Questions

Is Gemma 2 27B (27.00B) compatible with NVIDIA H100 SXM? expand_more
Yes, Gemma 2 27B is perfectly compatible with the NVIDIA H100 SXM, offering substantial VRAM headroom.
What VRAM is needed for Gemma 2 27B (27.00B)? expand_more
With q3_k_m quantization, Gemma 2 27B requires approximately 10.8GB of VRAM.
How fast will Gemma 2 27B (27.00B) run on NVIDIA H100 SXM? expand_more
Expect an estimated inference speed of around 90 tokens/sec with the suggested quantization and batch size.