The primary limiting factor in running the FLUX.1 Schnell model (12B parameters) on an NVIDIA RTX 4070 SUPER is the VRAM. FLUX.1 Schnell, a diffusion model, requires 24GB of VRAM when using FP16 (half-precision floating point) for its weights and activations. The RTX 4070 SUPER is equipped with 12GB of GDDR6X, leaving a significant deficit of 12GB. This means the model's data cannot be fully loaded onto the GPU, preventing direct inference. The high memory bandwidth of the GDDR6X (0.5 TB/s) is irrelevant in this scenario because the model cannot fit in the available memory. CUDA and Tensor core counts are sufficient for processing if the model were loaded, but the VRAM bottleneck is insurmountable without significant optimization.
Without sufficient VRAM, the system will likely resort to swapping data between the GPU and system RAM. This dramatically reduces performance, potentially making inference unusable. The model's context length of 77 tokens is relatively small, which is good, but this does not offset the VRAM limitations. The estimated tokens/sec and batch size are currently unknown, but they would be severely impacted due to the memory constraint. Even if offloading to system RAM is possible, the transfer speeds will be far slower than the GPU's memory bandwidth, resulting in a very poor user experience.
Given the VRAM limitation, directly running FLUX.1 Schnell on the RTX 4070 SUPER in FP16 is not feasible. To make this work, you must aggressively quantize the model. Consider using 8-bit integer quantization (INT8) or even 4-bit quantization (INT4) to significantly reduce the VRAM footprint. Tools like `llama.cpp` or `text-generation-inference` provide quantization and offloading capabilities that can potentially enable running the model, albeit with a potential decrease in quality and speed.
Alternatively, consider using cloud-based inference services that offer GPUs with sufficient VRAM or explore model distillation techniques to create a smaller, more efficient version of FLUX.1 Schnell that fits within the 12GB VRAM limit. If quantization is not sufficient or introduces unacceptable quality degradation, consider using a different, smaller model that is designed to run within your hardware constraints.