Can I run Mistral 7B (q3_k_m) on NVIDIA H100 PCIe?

check_circle
Perfect
Yes, you can run this model!
GPU VRAM
80.0GB
Required
2.8GB
Headroom
+77.2GB

VRAM Usage

0GB 3% used 80.0GB

Performance Estimate

Tokens/sec ~117.0
Batch size 32
Context 32768K

info Technical Analysis

The NVIDIA H100 PCIe, with its 80GB of HBM2e VRAM and 2.0 TB/s memory bandwidth, offers ample resources for running the Mistral 7B model, especially in its quantized q3_k_m form which requires only 2.8GB of VRAM. This leaves a substantial 77.2GB VRAM headroom, allowing for large batch sizes and concurrent inference tasks. The H100's 14592 CUDA cores and 456 Tensor Cores will significantly accelerate the matrix multiplications and other computations inherent in transformer-based models like Mistral 7B.

The memory bandwidth is crucial for feeding data to the compute units efficiently. With 2.0 TB/s, the H100 can keep the CUDA and Tensor Cores saturated with data, minimizing latency and maximizing throughput. The estimated 117 tokens/sec is a reasonable expectation, but actual performance will depend on factors like the specific inference framework used, the prompt complexity, and the batch size. The high VRAM also allows for experimenting with larger context lengths, potentially exceeding the model's default 32768 tokens, although this might impact performance and should be tested carefully.

Quantization to q3_k_m reduces the model's memory footprint and computational requirements, making it feasible to run on GPUs with less VRAM. However, it comes at the cost of some accuracy. The H100's sheer power means that even with quantization, a high level of performance can be maintained. Furthermore, the large VRAM allows for multiple instances of the model to be loaded simultaneously, increasing overall throughput if needed.

lightbulb Recommendation

Given the H100's capabilities, focus on maximizing throughput and exploring advanced features. Start with a batch size of 32 and experiment with larger values to find the optimal balance between latency and throughput. Consider using an optimized inference framework like vLLM or NVIDIA's TensorRT to further improve performance. While q3_k_m is efficient, explore higher precision quantization levels (e.g., q4_k_m or even FP16 if VRAM allows) to potentially improve output quality without sacrificing too much performance.

If you encounter performance bottlenecks, profile the application to identify the source of the issue. It could be related to data loading, pre/post-processing, or the inference kernel itself. Adjust the settings accordingly. For instance, if data loading is slow, consider using asynchronous data loading techniques. Also, experiment with different context lengths to see how it impacts performance. The 32768 context length can be extended, but test to see if the model's performance or accuracy degrades with longer context lengths. Ensure the NVIDIA drivers are up to date for best performance.

tune Recommended Settings

Batch_Size
32
Context_Length
32768
Other_Settings
['Enable CUDA graph capture', 'Use asynchronous data loading', 'Profile performance to identify bottlenecks']
Inference_Framework
vLLM
Quantization_Suggested
q4_k_m (if VRAM allows)

help Frequently Asked Questions

Is Mistral 7B (7.00B) compatible with NVIDIA H100 PCIe? expand_more
Yes, Mistral 7B is perfectly compatible with the NVIDIA H100 PCIe, especially with quantization.
What VRAM is needed for Mistral 7B (7.00B)? expand_more
With q3_k_m quantization, Mistral 7B requires approximately 2.8GB of VRAM.
How fast will Mistral 7B (7.00B) run on NVIDIA H100 PCIe? expand_more
Expect around 117 tokens/sec with the q3_k_m quantization and a well-optimized inference setup. This can vary depending on the prompt and batch size.