Can I run Llama 3 70B on NVIDIA H100 PCIe?

cancel
Fail/OOM
This GPU doesn't have enough VRAM
GPU VRAM
80.0GB
Required
140.0GB
Headroom
-60.0GB

VRAM Usage

0GB 100% used 80.0GB

info Technical Analysis

The NVIDIA H100 PCIe, while a powerful GPU, falls short of the VRAM requirements for running Llama 3 70B (70.00B) in FP16 (16-bit floating point) precision. Llama 3 70B (70.00B) in FP16 needs approximately 140GB of VRAM to load the model weights and perform inference. The H100 PCIe only provides 80GB of VRAM, resulting in a deficit of 60GB. This means the model cannot be loaded directly onto the GPU without modifications. The H100's impressive 2.0 TB/s memory bandwidth would be beneficial if the model could fit, enabling fast data transfer between memory and the GPU's compute units. However, the insufficient VRAM is the primary bottleneck in this scenario.

lightbulb Recommendation

To run Llama 3 70B (70.00B) on the H100 PCIe, you'll need to significantly reduce the model's memory footprint. Quantization is the most viable option. Consider using 4-bit quantization (bitsandbytes or similar) to reduce the VRAM requirement to around 35GB, which would fit comfortably within the H100's 80GB. Alternatively, offloading some layers to system RAM (CPU) is possible, but this will severely impact inference speed. Distributed inference across multiple GPUs is another option, but requires a more complex setup.

tune Recommended Settings

Batch_Size
Varies depending on quantization and context leng…
Context_Length
8192
Other_Settings
['Enable CUDA graphs', 'Use paged attention', 'Optimize tensor parallelism if using multiple GPUs']
Inference_Framework
vLLM
Quantization_Suggested
4-bit (QLoRA or NF4)

help Frequently Asked Questions

Is Llama 3 70B (70.00B) compatible with NVIDIA H100 PCIe? expand_more
No, not directly. The H100 PCIe has insufficient VRAM to load the full Llama 3 70B (70.00B) model in FP16.
What VRAM is needed for Llama 3 70B (70.00B)? expand_more
Llama 3 70B (70.00B) requires approximately 140GB of VRAM in FP16. Quantization can significantly reduce this requirement.
How fast will Llama 3 70B (70.00B) run on NVIDIA H100 PCIe? expand_more
Performance depends heavily on the quantization level. With 4-bit quantization, you can expect reasonable inference speeds, but without quantization, it will not run.