The NVIDIA H100 PCIe, while a powerful GPU, falls short of the VRAM requirements for running Llama 3 70B (70.00B) in FP16 (16-bit floating point) precision. Llama 3 70B (70.00B) in FP16 needs approximately 140GB of VRAM to load the model weights and perform inference. The H100 PCIe only provides 80GB of VRAM, resulting in a deficit of 60GB. This means the model cannot be loaded directly onto the GPU without modifications. The H100's impressive 2.0 TB/s memory bandwidth would be beneficial if the model could fit, enabling fast data transfer between memory and the GPU's compute units. However, the insufficient VRAM is the primary bottleneck in this scenario.
To run Llama 3 70B (70.00B) on the H100 PCIe, you'll need to significantly reduce the model's memory footprint. Quantization is the most viable option. Consider using 4-bit quantization (bitsandbytes or similar) to reduce the VRAM requirement to around 35GB, which would fit comfortably within the H100's 80GB. Alternatively, offloading some layers to system RAM (CPU) is possible, but this will severely impact inference speed. Distributed inference across multiple GPUs is another option, but requires a more complex setup.