He Kaiming's Team's New Work: After Deleting VAE and Private Data, Text-to-Image Generation Becomes Even Stronger
KaiMing He's team introduces **MiniT2I**, a minimalist text-to-image (T2I) model that challenges the complexity of mainstream approaches. It eliminates components commonly considered essential: the VAE encoder-decoder, AdaLN conditioning mechanisms, auxiliary losses, private training data, and post-training alignment stages like RL/DPO. Instead, it uses a pure flow-matching objective trained directly on RGB pixels.
The model employs a simplified **MM-JiT** Transformer architecture. It removes AdaLN blocks for conditioning and instead prepends two lightweight text adapter blocks to a standard pre-norm Transformer, allowing frozen T5 text features to adapt to the denoiser. Training follows a two-stage, LLM-like paradigm using only public datasets: pre-training on LLaVA-recaptioned CC12M for coverage, followed by fine-tuning on ~120k high-quality image-text pairs.
With just 258M parameters (B/16), MiniT2I achieves competitive scores (0.87 on GenEval, 84.2 on DPG-Bench), outperforming larger pixel-space models. Scaling to 912M parameters (L/16) yields results comparable to SD3-Medium (~2B parameters) in style, composition, and imagination, though it lags in text rendering and named entities due to public data limitations.
Key advantages include lower computational cost (~570 GFLOPs vs. ~1379 for latent models) and architectural simplicity. Acknowledged limitations include patch boundary artifacts in pixel space, side effects of high CFG scales, resolution ceilings for sequences longer than 1024 tokens, and the aforementioned data bottlenecks. The work demonstrates that high-performance T2I generation is possible with a radically simplified, publicly reproducible baseline.
marsbit12m ago