He Kaiming's Team's New Work: After Deleting VAE and Private Data, Text-to-Image Generation Becomes Even Stronger

marsbitPublished on 2026-06-22Last updated on 2026-06-22

Abstract

KaiMing He's team introduces **MiniT2I**, a minimalist text-to-image (T2I) model that challenges the complexity of mainstream approaches. It eliminates components commonly considered essential: the VAE encoder-decoder, AdaLN conditioning mechanisms, auxiliary losses, private training data, and post-training alignment stages like RL/DPO. Instead, it uses a pure flow-matching objective trained directly on RGB pixels. The model employs a simplified **MM-JiT** Transformer architecture. It removes AdaLN blocks for conditioning and instead prepends two lightweight text adapter blocks to a standard pre-norm Transformer, allowing frozen T5 text features to adapt to the denoiser. Training follows a two-stage, LLM-like paradigm using only public datasets: pre-training on LLaVA-recaptioned CC12M for coverage, followed by fine-tuning on ~120k high-quality image-text pairs. With just 258M parameters (B/16), MiniT2I achieves competitive scores (0.87 on GenEval, 84.2 on DPG-Bench), outperforming larger pixel-space models. Scaling to 912M parameters (L/16) yields results comparable to SD3-Medium (~2B parameters) in style, composition, and imagination, though it lags in text rendering and named entities due to public data limitations. Key advantages include lower computational cost (~570 GFLOPs vs. ~1379 for latent models) and architectural simplicity. Acknowledged limitations include patch boundary artifacts in pixel space, side effects of high CFG scales, resolution ceilings for sequence...

The field of text-to-image generation has long been a fiercely competitive red ocean, seemingly with no room left to innovate.

What do you need to train a powerful text-to-image model today?

Following the current mainstream approach, you would need: a pre-trained VAE encoder-decoder, concatenated text encoders, meticulously designed conditional injection mechanisms, massive datasets, RL or DPO alignment phases...

Overall, there seems to be a consensus: text-to-image generation must be this complex.

He Kaiming's team, however, takes a contrarian approach, offering a new perspective in the field of text-to-image models. They have released MiniT2I — a minimalist, pixel-space text-to-image model that deliberately pursues simplicity.

No VAE encoder-decoder, no AdaLN conditional injection, no auxiliary loss functions, no private data, no RL/DPO alignment, just pure flow matching trained directly on pixels. The 258M-parameter B/16 version achieves 0.87 on GenEval and 84.2 on DPG-Bench, surpassing pixel-space models several times its size.

The core proposition of MiniT2I is: If text conditioning is treated as 'context tokens with semantic information' and injected into the model, text-to-image generation and class-conditional ImageNet generation are not fundamentally that different — the architecture can be similar, computational requirements comparable, and even the scale of data can be aligned.

  • Paper Title: A Minimalist Baseline for Text-to-Image Generation
  • Technical Blog: https://peppaking8.github.io/#/post/minit2i
  • Open Source Repo: https://github.com/PeppaKing8/minit2i-jax

Technical Approach: Subtraction at Every Step

Direct Pixel-Space Output, No VAE

MiniT2I's first design choice is radical: discard the VAE, perform denoising directly on RGB pixels.

Latent Diffusion Models are the current mainstream paradigm, first compressing images into a low-dimensional latent space using an autoencoder before diffusion. This makes high-resolution generation feasible, but at the cost of introducing reconstruction error, an extra training phase, and misalignment between the encoder and denoiser objectives.

MiniT2I's choice of pixel space is pragmatic: For 512×512 resolution, using 16×16 patches to divide the image into 1024 tokens keeps the sequence length well within the Transformer's comfort zone. Removing the VAE reduces single-step forward computation from ~1379 GFLOPs to ~570 GFLOPs (B/16 setting), and eliminates the ceiling on reconstruction accuracy — the output quality is only limited by the denoiser's capability.

Experiments confirm this: Under the same parameter budget, pixel models achieve FID on par with latent space models (18.7 vs 19.0), but with a 5x lower per-step cost.

MM-JiT Architecture: Returning to a Simple Transformer

SD3's MM-DiT uses AdaLN (Adaptive Layer Normalization) within each block to inject timestep and pooled text embeddings into the network — each sub-block needs to compute scale, shift, and gate parameters generated by an extra MLP from the conditioning vectors. This is an elaborate modulation mechanism, but MiniT2I finds it non-essential.

The proposed MM-JiT architecture does two things:

1. Add Two Text Adapter Layers: Insert two lightweight Transformer blocks before joint attention, allowing the frozen T5 features to first 'adapt' to the denoiser's needs.

2. Remove the AdaLN Branch: No longer inject timestep and global text information through an additional path. The model can still perceive noise levels — because the noise-corrupted image itself carries timestep information.

The result is a clean architecture nearly identical to a standard pre-normalization Transformer. Removing AdaLN reduces parameters, allowing for more layers within the same compute budget (12 layers → 17 layers). FID drops from 18.7 to 13.7, and the architecture itself is easier to understand and modify.

Training Data: Fully Public, Two-Phase

MiniT2I's training data also pursues minimalism:

  • Pre-training: LLaVA-recaptioned CC12M (publicly available VLM re-captioned dataset), 250K steps
  • Fine-tuning: ~120K high-quality image-text pairs (BLIP3o-60K + LAION DALL・E 3 Discord set + ShareGPT-4o-Image), 40K steps

This 'pre-train then fine-tune' two-stage pattern directly mirrors LLM training paradigms: pre-training buys coverage, fine-tuning teaches the model what a good answer is. Ablations show both are indispensable — pre-training alone yields acceptable image quality but poor prompt following; fine-tuning alone makes the model's world too narrow, causing generative diversity to collapse.

Results: Small Model, Big Performance

In comparisons among pixel-space text-to-image models, MiniT2I offers exceptional value:

MiniT2I-B/16, with only ~600M total parameters (including text encoder), surpasses models 3-4 times its size on GenEval and DPG-Bench. Moreover, training cost is extremely low: the B/32 ablation model required only about 3 days on 8 H100s, with total training FLOPs comparable to a standard 200-epoch ImageNet experiment.

Scaling to L/16 (912M parameters) yields noticeable improvements in style diversity, spatial relationships, and text rendering, achieving quality on imaginative scenes comparable to or even better than SD3-Medium (~2B parameters).

In the more comprehensive PRISM-Bench evaluation, MiniT2I-L/16 performs well in style, composition, and imagination dimensions (79.9, 78.4, 57.9), approaching SD3-Medium levels. However, gaps remain in text rendering (30.6 vs SD3's 50.9) and named entities (60.3 vs 66.3) — the team acknowledges these are inherent limitations of the public data recipe, requiring targeted data to bridge.

Limitations and Outlook

MiniT2I is a proof of concept for a technical path, not a final product. The team honestly points out several unresolved issues:

  • Patch artifacts in pixel space: Measurable discontinuities exist at patch boundaries (gradients 17-22% higher at boundaries than elsewhere), a problem latent-space models do not have.
  • Side effects of CFG in pixel space: High guidance scales (~6) push local tokens away from the data manifold, directly exposing visual artifacts without a decoder's 'smoothing' effect.
  • Resolution ceiling: Works well at 512×512 currently; pushing to 4K+ requires longer sequences or more efficient attention mechanisms.
  • Data bottleneck: Text rendering and named entities remain weaker than industrial systems, requiring specialized data augmentation.

MiniT2I demonstrates that state-of-the-art text-to-image generation is no longer a game only for top industrial labs.

When a 258M-parameter model, trained on purely public data with academic-level compute for just 3 days, can defeat opponents orders of magnitude larger, perhaps text-to-image is undergoing a paradigm shift from 'brute force' to 'distillation'.

"T2I is no longer an insurmountable wall. Welcome to use and improve it, to build a simpler baseline."

This article is from the WeChat public account "机器之心" (Almost Human)

Trending Cryptos

Related Questions

QWhat is the main contribution or innovation of the MiniT2I model proposed by He Kaiming's team?

AThe main contribution is proposing MiniT2I, a minimalist text-to-image baseline model. It removes numerous complex components standard in current models—such as the VAE encoder-decoder, the AdaLN conditional injection mechanism, auxiliary loss functions, and private training data—and relies solely on flow matching objectives trained directly on pixel space. It demonstrates that with a simpler architecture and public data, it can achieve competitive performance against much larger models.

QHow does the architectural design of MiniT2I's MM-JiT differ from the commonly used MM-DiT in models like SD3?

AThe MM-JiT architecture in MiniT2I differs from MM-DiT by performing simplification in two key ways. First, it adds two lightweight text adapter Transformer blocks before joint attention to help frozen T5 features adapt to the denoiser. Second, and more importantly, it deletes the complex AdaLN (Adaptive Layer Normalization) branches used to inject timestep and text conditioning. This results in a cleaner, near-standard pre-norm Transformer architecture, reducing parameters and allowing for more layers within the same compute budget.

QWhat is the core argument for MiniT2I's choice to operate directly in pixel space instead of a latent space like most models?

AThe core argument is simplicity and alignment. Removing the VAE eliminates several issues: reconstruction error, extra training stages, and misalignment between encoder and denoiser objectives. For 512x512 images, patchifying into 1024 16x16 tokens keeps the sequence length manageable for Transformers. This direct approach reduces computational cost per forward pass significantly (~570 vs ~1379 GFLOPs for the B/16 configuration) and removes the upper bound of reconstruction accuracy, meaning the output quality depends directly on the denoiser's capability.

QWhat were the two stages of data used to train MiniT2I, and why was this two-stage approach necessary?

AMiniT2I was trained in two stages using only public data: 1) Pre-training on LLaVA-recaptioned CC12M (a VLM-recaptioned dataset) for 250K steps. 2) Fine-tuning on a combined set of ~120K high-quality image-text pairs from sources like BLIP3o-60K, LAION DALL・E 3 Discord set, and ShareGPT-4o-Image for 40K steps. This 'pre-train then fine-tune' paradigm mirrors LLM training. Ablation studies showed both stages are essential: pre-training alone gives good image quality but poor prompt following, while fine-tuning alone causes a collapse in generation diversity due to a limited worldview.

QAccording to the article, what are some of the key limitations or unsolved problems with the MiniT2I approach?

AThe key limitations highlighted include: 1) Patch boundary artifacts in pixel space, leading to measurable discontinuities not present in latent models. 2) Negative side effects of high CFG (Classifier-Free Guidance) scales in pixel space, which push local tokens off the data manifold and manifest as visual flaws. 3) A resolution ceiling, as scaling to 4K+ would require longer sequences or more efficient attention. 4) Data bottlenecks, particularly in text rendering and named entity accuracy, which lag behind industrial systems and would require specialized data to improve.

Related Reads

Ethlabs Founded, Treasury Companies to Fund Ethereum Post-EF

Former Ethereum Foundation (EF) core researchers Ansgar Dietrichs, Barnabé Monnot, Caspar Schwarz-Schilling, Josh Rudolf, and Julian Ma announced the launch of Ethlabs, an independent non-profit R&D lab focused on Ethereum core protocol research and institutional-grade infrastructure. The initiative, backed by over 50 community participants including ETH treasury companies BitMine and Sharplink, Joseph Lubin, Hayden Adams, and Jesse Pollak, aims to make Ethereum the global economic settlement layer. This move comes amidst significant pressure on the EF, which has seen key departures and a strategic narrowing of its focus. A critical funding gap of approximately $30 million annually for core client development, following the expiration of the client incentive program, poses a near-term risk to the network's development. The context includes the evolution of ETH's value narrative. While mechanisms like EIP-1559 and the Merge previously supported the "ultrasound money" thesis, the success of L2 scaling via EIP-4844 has drastically reduced L1 fee revenue, leading to net ETH issuance and challenging that narrative. Ethlabs has listed ETH monetary economics as a primary research focus. Backing from corporate ETH treasuries like BitMine and Sharplink represents a strategic alignment, as these entities' asset values are directly tied to Ethereum's health and adoption. Their support is an investment rather than a pure donation. Ethereum's governance is shifting from a centralized EF model to a distributed network of specialized "manager nodes," including Ethlabs and a streamlined EF. While this promotes efficiency and reduces single-point failure risk, it introduces new challenges in coordination, priority alignment, and filling critical funding gaps across the decentralized ecosystem.

Foresight News4m ago

Ethlabs Founded, Treasury Companies to Fund Ethereum Post-EF

Foresight News4m ago

From Logo to Bo Niu: TRON Further Perfects Its Brand Visual Assets

On June 23rd, TRON completed a significant upgrade to its official mascot, Bo Niu. The revamped character features larger, brighter eyes, more expressive facial details, and a clearer "T" structural motif, while retaining its signature red-and-white color scheme and horned design. This refresh aims to enhance Bo Niu's approachability, emotional range, and versatility for use across social media, community interactions, offline events, and branded merchandise. The redesign focuses on creating a stronger first impression. A more open facial structure with distinct, expressive eyes and the addition of a mouth with a small fang make the character friendlier and more suitable for dynamic content like animations and emojis. Subtle brand elements are integrated, such as stylized cheek lines inspired by "signal" icons, referencing the "wave" in "TRON," and a "T" shape formed by its smile and chest markings. Bo Niu has also been given a more defined personality as "TRON's Chief Luck Officer," with traits like being playful and sweet. This persona provides a more accessible and emotionally resonant entry point to the TRON brand, contrasting with often technical Web3 narratives. This mascot upgrade is part of TRON's ongoing effort to build a comprehensive and extensible visual identity system, following its recent logo refresh. Bo Niu is positioned as a key asset to connect with users, foster community, and convey brand warmth in everyday contexts.

marsbit8m ago

From Logo to Bo Niu: TRON Further Perfects Its Brand Visual Assets

marsbit8m ago

TRON Refreshes the Bull Image, Creating a More Approachable Brand Character

TRON's official mascot "BONiu" (Wave Bull) has received a comprehensive visual upgrade. Retaining its core red-and-white color scheme, horned silhouette, and brand DNA, the refreshed character features larger, brighter eyes, more expressive facial details including a mouth with a small fang, and enhanced emotive capabilities. The redesign aims to strengthen the mascot's亲和力, emotional expressiveness, and adaptability across various scenarios. Key updates include a clearer facial structure for instant recognition, a simplified and more intuitive五官 design, and the integration of subtle brand language. The cheek blushes are now inspired by a "signal" icon, while the smile and chest lines form a stable "T" structure, creating a cohesive超级符号 for the brand. The character has also been equipped with a 12-phoneme lip-sync system to support future动画 and interactive content. Beyond its visual role, BONiu's persona has been enriched. Now titled "TRON's Chief Luck Officer," it carries playful personality tags like "foodie enthusiast" and "full-of-tricks," allowing it to engage with the community in a more approachable and relatable manner. This update provides a lower-barrier, emotionally warm entry point for users amidst the often technical and abstract narratives of Web3. This mascot revamp is part of TRON's ongoing effort to refine its visual asset system, following the earlier logo update. By evolving from a static visual into a dynamic, expressive brand角色, the new BONiu is positioned to become a key asset for connecting with users, building brand记忆, and conveying TRON's personality across社交传播, community互动,线下活动, and merchandise.

链捕手25m ago

TRON Refreshes the Bull Image, Creating a More Approachable Brand Character

链捕手25m ago

With Labour Changing Leaders, Is the Long-Suppressed UK Crypto Market About to Turn Around?

Labour leader change: Hope for UK crypto market? With Keir Starmer's resignation as Prime Minister and Labour leader, a leadership contest has begun. Andy Burnham, the former Mayor of Greater Manchester and now the overwhelming favourite to succeed, has sparked cautious optimism within the UK cryptocurrency industry. Industry figures hope Burnham, seen as more receptive to digital assets than much of the Labour establishment, could shift the party's traditionally harder line. The leadership transition is expected to be swift, with prediction markets like Polymarket assigning a 97% probability to Burnham becoming the next Prime Minister. However, this political shift comes as a comprehensive regulatory framework for crypto, established by law earlier this year, is in its final implementation phase. The Financial Conduct Authority (FCA) is finalizing detailed rules covering trading, custody, stablecoins, and market abuse, with the full regime set to go live in October 2027. While a new Prime Minister can reshuffle ministers and adjust policy priorities, the core regulatory architecture is now law and unlikely to be fundamentally overturned without significant, deliberate government intervention. The main industry hope is that a Burnham government, focusing on economic growth, will ensure the FCA's implementation is pragmatic and growth-oriented. Industry advocates seek proportionate capital requirements, a streamlined licensing process, and clear rules for staking and stablecoins. They argue that embracing the crypto sector could attract investment and listings to London's struggling markets. Despite the optimism, concerns remain that regulatory implementation may still be influenced by more sceptical factions within the Labour party.

Foresight News54m ago

With Labour Changing Leaders, Is the Long-Suppressed UK Crypto Market About to Turn Around?

Foresight News54m ago

Trading

Spot
Futures

Hot Articles

What is ₿O₿

Bitcoin Bob ($₿o₿): Pioneering Bitcoin-Centric DeFi Through Hybrid Layer-2 Innovation In an era where the digital economy is rapidly evolving, Bitcoin Bob ($₿o₿) emerges as a revolutionary project aiming to enhance Bitcoin's utility in the decentralized finance (DeFi) sector. Officially launched in May 2024, Bitcoin Bob, also known as Build on Bitcoin (BOB), represents a hybrid Layer-2 blockchain solution that melds Bitcoin’s renowned security and immutability with Ethereum's programmability. This initiative seeks to fill a crucial gap in the Bitcoin ecosystem by facilitating the integration of smart contracts and decentralized applications while maintaining the core principles of trust and security inherent to Bitcoin. With significant backing from prominent venture capitalists, Bitcoin Bob is positioned to redefine the role of Bitcoin in the DeFi landscape, making it a cornerstone of decentralized financial operations globally. What Is Bitcoin Bob, $₿o₿? At its core, Bitcoin Bob is a hybrid blockchain solution designed to enhance the functionality of Bitcoin. The main objective of the project is to enable decentralized finance on Bitcoin, facilitating swift and seamless transactions while ensuring high levels of security. Bitcoin Bob employs advanced technology, specifically a hybrid layer-2 architecture that combines Bitcoin's security attributes with the programmability and flexibility of the Ethereum Virtual Machine (EVM). This pragmatic approach allows the project to operate effectively without compromising the fundamental values of Bitcoin, making it a monumental step in bridging the gap between traditional Bitcoin holders and the emerging DeFi ecosystem. One of the standout features of Bitcoin Bob is its role in providing a trust-minimized environment through innovative mechanisms, such as optimistic rollups initially relying on Ethereum, transitioning eventually to full Bitcoin integration. This hybrid system is designed to ensure that the vast liquidity present in Bitcoin is not only preserved but also utilized effectively in various DeFi protocols. Who Is the Creator of Bitcoin Bob, $₿o₿? The creative force behind Bitcoin Bob is co-founder and CEO Alexei Zamyatin, who brings a wealth of experience and knowledge from his extensive background in the cryptocurrency space. Zamyatin holds a PhD in Computer Science and has been actively involved in Bitcoin development since 2015. His deep understanding of both Bitcoin and Ethereum ecosystems plays a crucial role in shaping Bitcoin Bob’s vision and technological underpinnings. Alongside Zamyatin is co-founder Dominik Harz, who serves as the Chief Technology Officer (CTO). Together, the duo has cultivated a team of talented individuals with a shared passion for pushing the boundaries of blockchain technology, ensuring Bitcoin Bob's innovative stature in the market. Who Are the Investors of Bitcoin Bob, $₿o₿? Bitcoin Bob has successfully garnered support from a range of prominent investors and venture capital firms that recognize its potential to transform the Bitcoin landscape. In March 2024, the project completed a robust $10 million seed funding round, led by Castle Island Ventures, with notable participation from firms like Coinbase Ventures and Bankless Ventures. Shortly afterward, in July 2024, Bitcoin Bob secured an additional $1.6 million in strategic funding. This round was co-led by Ledger Ventures and featured angels from various prominent firms such as BlackRock, Aave, and Curve. The strong financial backing reflects an industry-wide recognition of Bitcoin Bob’s innovative approach to unlocking Bitcoin’s potential in the DeFi space. This funding is crucial not only for the project’s continued development but also for establishing an incubator to foster Bitcoin-native decentralized applications (dApps) aimed specifically at meeting the needs of a growing user base. How Does Bitcoin Bob, $₿o₿ Work? The operational mechanics of Bitcoin Bob are rooted in its hybrid rollup architecture, which is designed to combine the benefits of Bitcoin's security with the versatility of Ethereum’s EVM. The project employs a phased security model that outlines its interaction with users and developers in the following manner: Phase 1 – The initial phase operates as an optimistic rollup on Ethereum, wherein transactions are processed with a promising expectation of validity, paving the way for future developments on Bitcoin. Phase 2 – As the project transitions, it will integrate Bitcoin finality through Bitcoin Staking, leveraging the Babylon Network to enhance security. This mechanism requires validators to lock up Bitcoin, thus verifying BOB transactions, which not only enhances security but also creates yield prospects for participants. Phase 3 – The forward-looking vision for Bitcoin Bob is to fully integrate with Bitcoin, using innovative technologies such as BitVM and zero-knowledge proofs to facilitate off-chain computation while retaining the security integrity of Bitcoin. Key innovations such as BitVM2, a trust-minimized bridge protocol co-authored by Zamyatin, are critical to the project's functionality, allowing for Bitcoin deposits and withdrawals without the need for extensive network reliance. This enables the ecosystem to efficiently connect with Ethereum and other compatible chains, creating a streamlined and effective interaction model for users and developers. Timeline of Bitcoin Bob, $₿o₿ Understanding the evolution of Bitcoin Bob involves tracking its important milestones: 2019: Alexei Zamyatin and Dominik Harz establish a research firm focused on blockchain solutions, laying the groundwork for future projects. March 2024: Bitcoin Bob successfully raises $10 million in a seed funding round, marking its entrance into the competitive blockchain landscape. May 1, 2024: The official mainnet launch occurs, showcasing the project’s capabilities with significant user adoption and total value locked (TVL). July 2024: The project attracts an additional $1.6 million in strategic funding for establishing its incubator, aimed at fostering Bitcoin-driven innovations. October 2024: Bitcoin Bob releases a “Vision Paper,” detailing its hybrid layer-2 design and forward-looking strategies. 2025: Expected rollout of Phase 2 features, focusing on Bitcoin finality and BitVM bridges aimed at enhancing overall functionality. Conclusion: Redefining Bitcoin’s Role in Decentralized Finance Bitcoin Bob ($₿o₿) is not just another blockchain project; it represents a paradigm shift in the way Bitcoin can interact with broader financial applications. By meticulously combining Bitcoin's security with Ethereum's flexibility, Bitcoin Bob aims to reshape the DeFi landscape, bridging the gap between digital currency and decentralized applications. With a robust technological framework, strong leadership, and strategic funding, Bitcoin Bob is well-positioned to establish itself as a fundamental player in the cryptocurrency ecosystem, unlocking new dimensions of liquidity and utility for Bitcoin. As the project continues to evolve and expand, it promises to usher in a new era of innovation, proving that Bitcoin's potential extends far beyond being a mere store of value, but rather as a cornerstone of the future financial landscape. As the project advances through its anticipated phases, all eyes will be on Bitcoin Bob, particularly regarding its commitment to incorporating decentralized principles and ensuring that users can enjoy the full benefits of DeFi anchored by Bitcoin.

37 Total ViewsPublished 2025.06.30Updated 2025.06.30

What is ₿O₿

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of O (O) are presented below.

活动图片