The NVIDIA H100 SXM, with its 80GB of HBM3 memory and 3.35 TB/s memory bandwidth, is a powerhouse for many AI workloads. However, running Llama 3 70B (70.00B) in FP16 (float16) precision presents a challenge. This model requires approximately 140GB of VRAM to load and operate effectively due to the model's 70 billion parameters. The H100's 80GB capacity falls short by a significant 60GB. This VRAM deficit means the model cannot be loaded entirely onto the GPU, leading to errors or preventing execution altogether.
While the H100's architecture, including its 16896 CUDA cores and 528 Tensor Cores, is well-suited for accelerating matrix multiplications and other computations involved in running large language models, the insufficient VRAM becomes the primary bottleneck. Without sufficient memory to hold the model's weights and activations, the theoretical compute performance cannot be realized. The high memory bandwidth is also underutilized in this scenario, as the GPU will struggle to access the necessary data.
In its current configuration, the H100 will not be able to run Llama 3 70B (70.00B) effectively. The large negative VRAM headroom suggests that even small batch sizes or shorter context lengths will not resolve the fundamental memory limitation. The estimated tokens/sec and batch size will both be zero in this scenario due to the model's inability to load.
To run Llama 3 70B (70.00B) on the NVIDIA H100 SXM, you will need to significantly reduce the model's memory footprint. The primary method for achieving this is through quantization. Quantization reduces the precision of the model's weights, thereby reducing the amount of VRAM required. Consider using 4-bit quantization (bitsandbytes or GPTQ) to compress the model significantly.
Alternatively, explore techniques like offloading some layers to system RAM (CPU). However, this will drastically reduce performance due to the slower memory access speeds. Distributed inference across multiple GPUs, if available, would be another viable option, but it requires significant setup and infrastructure. If neither quantization nor distributed inference is feasible, consider using a smaller model variant of Llama 3 or exploring cloud-based solutions with larger GPU memory capacities.