He Kaiming's Team's New Work: After Deleting VAE and Private Data, Text-to-Image Generation Becomes Even Stronger

marsbitPublicado em 2026-06-22Última atualização em 2026-06-22

Resumo

KaiMing He's team introduces **MiniT2I**, a minimalist text-to-image (T2I) model that challenges the complexity of mainstream approaches. It eliminates components commonly considered essential: the VAE encoder-decoder, AdaLN conditioning mechanisms, auxiliary losses, private training data, and post-training alignment stages like RL/DPO. Instead, it uses a pure flow-matching objective trained directly on RGB pixels. The model employs a simplified **MM-JiT** Transformer architecture. It removes AdaLN blocks for conditioning and instead prepends two lightweight text adapter blocks to a standard pre-norm Transformer, allowing frozen T5 text features to adapt to the denoiser. Training follows a two-stage, LLM-like paradigm using only public datasets: pre-training on LLaVA-recaptioned CC12M for coverage, followed by fine-tuning on ~120k high-quality image-text pairs. With just 258M parameters (B/16), MiniT2I achieves competitive scores (0.87 on GenEval, 84.2 on DPG-Bench), outperforming larger pixel-space models. Scaling to 912M parameters (L/16) yields results comparable to SD3-Medium (~2B parameters) in style, composition, and imagination, though it lags in text rendering and named entities due to public data limitations. Key advantages include lower computational cost (~570 GFLOPs vs. ~1379 for latent models) and architectural simplicity. Acknowledged limitations include patch boundary artifacts in pixel space, side effects of high CFG scales, resolution ceilings for sequence...

The field of text-to-image generation has long been a fiercely competitive red ocean, seemingly with no room left to innovate.

What do you need to train a powerful text-to-image model today?

Following the current mainstream approach, you would need: a pre-trained VAE encoder-decoder, concatenated text encoders, meticulously designed conditional injection mechanisms, massive datasets, RL or DPO alignment phases...

Overall, there seems to be a consensus: text-to-image generation must be this complex.

He Kaiming's team, however, takes a contrarian approach, offering a new perspective in the field of text-to-image models. They have released MiniT2I — a minimalist, pixel-space text-to-image model that deliberately pursues simplicity.

No VAE encoder-decoder, no AdaLN conditional injection, no auxiliary loss functions, no private data, no RL/DPO alignment, just pure flow matching trained directly on pixels. The 258M-parameter B/16 version achieves 0.87 on GenEval and 84.2 on DPG-Bench, surpassing pixel-space models several times its size.

The core proposition of MiniT2I is: If text conditioning is treated as 'context tokens with semantic information' and injected into the model, text-to-image generation and class-conditional ImageNet generation are not fundamentally that different — the architecture can be similar, computational requirements comparable, and even the scale of data can be aligned.

  • Paper Title: A Minimalist Baseline for Text-to-Image Generation
  • Technical Blog: https://peppaking8.github.io/#/post/minit2i
  • Open Source Repo: https://github.com/PeppaKing8/minit2i-jax

Technical Approach: Subtraction at Every Step

Direct Pixel-Space Output, No VAE

MiniT2I's first design choice is radical: discard the VAE, perform denoising directly on RGB pixels.

Latent Diffusion Models are the current mainstream paradigm, first compressing images into a low-dimensional latent space using an autoencoder before diffusion. This makes high-resolution generation feasible, but at the cost of introducing reconstruction error, an extra training phase, and misalignment between the encoder and denoiser objectives.

MiniT2I's choice of pixel space is pragmatic: For 512×512 resolution, using 16×16 patches to divide the image into 1024 tokens keeps the sequence length well within the Transformer's comfort zone. Removing the VAE reduces single-step forward computation from ~1379 GFLOPs to ~570 GFLOPs (B/16 setting), and eliminates the ceiling on reconstruction accuracy — the output quality is only limited by the denoiser's capability.

Experiments confirm this: Under the same parameter budget, pixel models achieve FID on par with latent space models (18.7 vs 19.0), but with a 5x lower per-step cost.

MM-JiT Architecture: Returning to a Simple Transformer

SD3's MM-DiT uses AdaLN (Adaptive Layer Normalization) within each block to inject timestep and pooled text embeddings into the network — each sub-block needs to compute scale, shift, and gate parameters generated by an extra MLP from the conditioning vectors. This is an elaborate modulation mechanism, but MiniT2I finds it non-essential.

The proposed MM-JiT architecture does two things:

1. Add Two Text Adapter Layers: Insert two lightweight Transformer blocks before joint attention, allowing the frozen T5 features to first 'adapt' to the denoiser's needs.

2. Remove the AdaLN Branch: No longer inject timestep and global text information through an additional path. The model can still perceive noise levels — because the noise-corrupted image itself carries timestep information.

The result is a clean architecture nearly identical to a standard pre-normalization Transformer. Removing AdaLN reduces parameters, allowing for more layers within the same compute budget (12 layers → 17 layers). FID drops from 18.7 to 13.7, and the architecture itself is easier to understand and modify.

Training Data: Fully Public, Two-Phase

MiniT2I's training data also pursues minimalism:

  • Pre-training: LLaVA-recaptioned CC12M (publicly available VLM re-captioned dataset), 250K steps
  • Fine-tuning: ~120K high-quality image-text pairs (BLIP3o-60K + LAION DALL・E 3 Discord set + ShareGPT-4o-Image), 40K steps

This 'pre-train then fine-tune' two-stage pattern directly mirrors LLM training paradigms: pre-training buys coverage, fine-tuning teaches the model what a good answer is. Ablations show both are indispensable — pre-training alone yields acceptable image quality but poor prompt following; fine-tuning alone makes the model's world too narrow, causing generative diversity to collapse.

Results: Small Model, Big Performance

In comparisons among pixel-space text-to-image models, MiniT2I offers exceptional value:

MiniT2I-B/16, with only ~600M total parameters (including text encoder), surpasses models 3-4 times its size on GenEval and DPG-Bench. Moreover, training cost is extremely low: the B/32 ablation model required only about 3 days on 8 H100s, with total training FLOPs comparable to a standard 200-epoch ImageNet experiment.

Scaling to L/16 (912M parameters) yields noticeable improvements in style diversity, spatial relationships, and text rendering, achieving quality on imaginative scenes comparable to or even better than SD3-Medium (~2B parameters).

In the more comprehensive PRISM-Bench evaluation, MiniT2I-L/16 performs well in style, composition, and imagination dimensions (79.9, 78.4, 57.9), approaching SD3-Medium levels. However, gaps remain in text rendering (30.6 vs SD3's 50.9) and named entities (60.3 vs 66.3) — the team acknowledges these are inherent limitations of the public data recipe, requiring targeted data to bridge.

Limitations and Outlook

MiniT2I is a proof of concept for a technical path, not a final product. The team honestly points out several unresolved issues:

  • Patch artifacts in pixel space: Measurable discontinuities exist at patch boundaries (gradients 17-22% higher at boundaries than elsewhere), a problem latent-space models do not have.
  • Side effects of CFG in pixel space: High guidance scales (~6) push local tokens away from the data manifold, directly exposing visual artifacts without a decoder's 'smoothing' effect.
  • Resolution ceiling: Works well at 512×512 currently; pushing to 4K+ requires longer sequences or more efficient attention mechanisms.
  • Data bottleneck: Text rendering and named entities remain weaker than industrial systems, requiring specialized data augmentation.

MiniT2I demonstrates that state-of-the-art text-to-image generation is no longer a game only for top industrial labs.

When a 258M-parameter model, trained on purely public data with academic-level compute for just 3 days, can defeat opponents orders of magnitude larger, perhaps text-to-image is undergoing a paradigm shift from 'brute force' to 'distillation'.

"T2I is no longer an insurmountable wall. Welcome to use and improve it, to build a simpler baseline."

This article is from the WeChat public account "机器之心" (Almost Human)

Criptomoedas em alta

Perguntas relacionadas

QWhat is the main contribution or innovation of the MiniT2I model proposed by He Kaiming's team?

AThe main contribution is proposing MiniT2I, a minimalist text-to-image baseline model. It removes numerous complex components standard in current models—such as the VAE encoder-decoder, the AdaLN conditional injection mechanism, auxiliary loss functions, and private training data—and relies solely on flow matching objectives trained directly on pixel space. It demonstrates that with a simpler architecture and public data, it can achieve competitive performance against much larger models.

QHow does the architectural design of MiniT2I's MM-JiT differ from the commonly used MM-DiT in models like SD3?

AThe MM-JiT architecture in MiniT2I differs from MM-DiT by performing simplification in two key ways. First, it adds two lightweight text adapter Transformer blocks before joint attention to help frozen T5 features adapt to the denoiser. Second, and more importantly, it deletes the complex AdaLN (Adaptive Layer Normalization) branches used to inject timestep and text conditioning. This results in a cleaner, near-standard pre-norm Transformer architecture, reducing parameters and allowing for more layers within the same compute budget.

QWhat is the core argument for MiniT2I's choice to operate directly in pixel space instead of a latent space like most models?

AThe core argument is simplicity and alignment. Removing the VAE eliminates several issues: reconstruction error, extra training stages, and misalignment between encoder and denoiser objectives. For 512x512 images, patchifying into 1024 16x16 tokens keeps the sequence length manageable for Transformers. This direct approach reduces computational cost per forward pass significantly (~570 vs ~1379 GFLOPs for the B/16 configuration) and removes the upper bound of reconstruction accuracy, meaning the output quality depends directly on the denoiser's capability.

QWhat were the two stages of data used to train MiniT2I, and why was this two-stage approach necessary?

AMiniT2I was trained in two stages using only public data: 1) Pre-training on LLaVA-recaptioned CC12M (a VLM-recaptioned dataset) for 250K steps. 2) Fine-tuning on a combined set of ~120K high-quality image-text pairs from sources like BLIP3o-60K, LAION DALL・E 3 Discord set, and ShareGPT-4o-Image for 40K steps. This 'pre-train then fine-tune' paradigm mirrors LLM training. Ablation studies showed both stages are essential: pre-training alone gives good image quality but poor prompt following, while fine-tuning alone causes a collapse in generation diversity due to a limited worldview.

QAccording to the article, what are some of the key limitations or unsolved problems with the MiniT2I approach?

AThe key limitations highlighted include: 1) Patch boundary artifacts in pixel space, leading to measurable discontinuities not present in latent models. 2) Negative side effects of high CFG (Classifier-Free Guidance) scales in pixel space, which push local tokens off the data manifold and manifest as visual flaws. 3) A resolution ceiling, as scaling to 4K+ would require longer sequences or more efficient attention. 4) Data bottlenecks, particularly in text rendering and named entity accuracy, which lag behind industrial systems and would require specialized data to improve.

Leituras Relacionadas

Farewell to Speculation: The Graham Moment of the Crypto Industry

"Farewell to Speculation: The Graham Moment for the Crypto Industry" The article draws a parallel between today's cryptocurrency market and the speculative, unregulated US stock market of the 1920s. That era lacked mandatory corporate disclosure, enabling rampant manipulation and turning stocks into gambling tools. The 1929 crash led to foundational reforms: the Securities Acts of 1933/34 mandated transparent, audited financial reporting, and Benjamin Graham's "Security Analysis" provided a framework for fundamental valuation. Together, they created modern investing, requiring both reliable data and a methodology to value assets. Similarly, the crypto market is currently driven by narratives and speculation. However, it possesses a key advantage: unlike 1920s corporations, blockchain protocols have inherently transparent, on-chain data for revenue, treasury, and activity. The core obstacle is not transparency, but the lack of legal claim to that value. Due to regulatory uncertainty (primarily the Howey Test), most tokens are deliberately stripped of economic rights like profit-sharing to avoid being classified as securities. This creates a paradox where protocols generate revenue, but token holders have no right to it. The turning point, argues the author, is imminent US legislation. The already-passed GENIUS Act provides a framework for stablecoins. The crucial CLARITY Act, currently in advanced legislative stages, aims to clearly categorize digital assets and define their regulatory treatment (SEC vs. CFTC). This would allow developers to legally design tokens with enforceable economic rights, such as profit distribution. If passed, this would enable a shift from speculation to fundamental investment. Analysis would focus on protocol revenue sustainability, network effects, valuation multiples, and the specific rights encoded in a token's contract—mirroring traditional equity analysis. The article notes significant legislative hurdles and timelines (1-3 years for rulemaking post-passage), but emphasizes the direction is set. A deeper challenge remains: building decentralized, legally enforceable governance and ownership structures to protect token holder rights, akin to corporate law. This will be a core development focus. The transformation applies mainly to revenue-generating protocol tokens, not to assets like Bitcoin (digital gold). The article concludes that the industry's question has evolved from "can tokens create value?" to "who gets to allocate that value?". Solving the latter, as in the 1920s, will mark crypto's transition to a legitimate asset class for fundamental investment.

Foresight NewsHá 9m

Farewell to Speculation: The Graham Moment of the Crypto Industry

Foresight NewsHá 9m

Quantum Computing "Manhattan Project" Unveiled: Is the Encryption Industry at a Critical Turning Point?

"Quantum Computing 'Manhattan Project' Launched: Is the Crypto Industry at a Critical Juncture?" On June 22, former U.S. President Donald Trump signed two executive orders. The first mandates all federal agencies upgrade their cryptographic systems to new, quantum-resistant standards by 2030. The second orders the Department of Energy to lead the development of a national quantum computer, signaling a shift from laboratory research to a state-enforced national agenda. This creates a hard deadline. A powerful quantum computer could break current encryption. The threat is compounded by "harvest now, decrypt later" attacks, where encrypted data is stored today for future decryption. Federal agencies must appoint migration officers and complete post-quantum cryptography (PQC) upgrades for key establishment by 2030 and digital signatures by 2031. Procurement rules will also be changed, forcing government contractors to comply. The crypto industry faces a direct threat. Bitcoin's ECDSA signatures are theoretically vulnerable. Research indicates millions of Bitcoin with exposed public keys are at risk if quantum computers advance. While projects like Bitcoin Quantum testnets and efforts by Ethereum, Solana, NEAR, and Zcash are exploring quantum-resistant solutions, achieving consensus in decentralized networks remains a major challenge. The centralized U.S. government has started a 5-year countdown. For decentralized crypto networks, the real test is whether they can complete this anti-quantum upgrade before the theoretical threat becomes a practical reality.

Foresight NewsHá 30m

Quantum Computing "Manhattan Project" Unveiled: Is the Encryption Industry at a Critical Turning Point?

Foresight NewsHá 30m

Coin Stock Barometer丨BitMine's Total Assets and Investment Reach $10.7 Billion, Exceeding ~$9.3 Billion Floating Loss; Strategy Buys Only 520 BTC, Strive Adds Positions Against the Trend (June 23)

This article provides a weekly market update on "coin-equity" trends, focusing on listed companies holding major cryptocurrencies. Key highlights include: **General Market Trends:** Global equities, particularly in the US, Japan, and South Korea, faced significant sell-offs, led by large tech and AI-related stocks. Analysts cite profit-taking and a shift from hype-driven to performance-driven valuation for AI companies. Market focus is on upcoming Micron Technology's earnings. **Cryptocurrency Treasury Updates:** * **Bitcoin (BTC):** Net weekly BTC purchases by listed companies (excluding miners) totaled approximately $86 million, down 13.97% from the prior week. Strategy (formerly MicroStrategy) purchased only 520 BTC for ~$34.9 million, while Strive Asset Management increased its holdings by 759 BTC for ~$50 million. Other notable actions include Mara Holdings adding 1,000 BTC and Capital B shareholders approving a massive financing plan (up to ~$1.2 trillion) to potentially expand its Bitcoin reserves. * **Ethereum (ETH):** BitMine emerged as the largest corporate ETH treasury, holding 5.67 million ETH (4.7% of supply). It purchased an additional 52,203 ETH ($92 million) in the past week. Sharplink completed a $75 million private placement to fund further ETH accumulation and stock buybacks. * **Solana (SOL):** The top five listed companies hold over 15.7 million SOL combined. However, Solmate Infrastructure, a SOL treasury firm, faces a lawsuit from its largest external shareholder alleging board misconduct and self-dealing. * **Other:** Updates include Canton Strategic's $50 million stock buyback plan and Lite Strategy's $1 million strategic investment in LitVM, a Layer-2 network for Litecoin. The article notes that while crypto treasury firms continue fundraising and accumulation, their stocks may struggle to rise against the broader market downturn until Q4.

marsbitHá 44m

Coin Stock Barometer丨BitMine's Total Assets and Investment Reach $10.7 Billion, Exceeding ~$9.3 Billion Floating Loss; Strategy Buys Only 520 BTC, Strive Adds Positions Against the Trend (June 23)

marsbitHá 44m

Trading

Spot
Futuros

Artigos em Destaque

O que é RE

I. Introdução ao Projeto O Re Protocol é um mercado de capitais onchain que conecta capital em stablecoin a resseguros totalmente colateralizados e regulados. O resseguro permite que as companhias de seguros transfiram partes do seu risco para outros seguradores especializados, ajudando-as a absorver grandes perdas e a continuar a fornecer cobertura. II. Informações sobre o Token Nome do token: RE(Re) III. Links Relacionados Website:https://re.xyz/ Exploradores:https://bscscan.com/token/0xd41fdb03ba84762dd66a0af1a6c8540ff1ba5dfb https://etherscan.io/token/0x526526528f35ac738177003b8773b402b8df8143 Twitter:https://twitter.com/re Nota: A introdução ao projeto provém dos materiais publicados ou fornecidos pela equipa oficial do projeto, que é apenas para referência e não constitui aconselhamento de investimento. A HTX não se responsabiliza por quaisquer perdas diretas ou indiretas resultantes.

56 Visualizações TotaisPublicado em {updateTime}Atualizado em 2026.06.18

O que é RE

Como comprar O

Bem-vindo à HTX.com!Tornámos a compra de O1 exchange (O) simples e conveniente.Segue o nosso guia passo a passo para iniciar a tua jornada no mundo das criptos.Passo 1: cria a tua conta HTXUtiliza o teu e-mail ou número de telefone para te inscreveres numa conta gratuita na HTX.Desfruta de um processo de inscrição sem complicações e desbloqueia todas as funcionalidades.Obter a minha contaPasso 2: vai para Comprar Cripto e escolhe o teu método de pagamentoCartão de crédito/débito: usa o teu visa ou mastercard para comprar O1 exchange (O) instantaneamente.Saldo: usa os fundos da tua conta HTX para transacionar sem problemas.Terceiros: adicionamos métodos de pagamento populares, como Google Pay e Apple Pay, para aumentar a conveniência.P2P: transaciona diretamente com outros utilizadores na HTX.Mercado de balcão (OTC): oferecemos serviços personalizados e taxas de câmbio competitivas para os traders.Passo 3: armazena teu O1 exchange (O)Depois de comprar o teu O1 exchange (O), armazena-o na tua conta HTX.Alternativamente, podes enviá-lo para outro lugar através de transferência blockchain ou usá-lo para transacionar outras criptomoedas.Passo 4: transaciona O1 exchange (O)Transaciona facilmente O1 exchange (O) no mercado à vista da HTX.Acede simplesmente à tua conta, seleciona o teu par de trading, executa as tuas transações e monitoriza em tempo real.Oferecemos uma experiência de fácil utilização tanto para principiantes como para traders experientes.

18 Visualizações TotaisPublicado em {updateTime}Atualizado em 2026.06.19

Como comprar O

O que é VERONA

I. Introdução ao Projeto VERONA é uma blockchain construída para todos, em todo o lado, através da abstração da cadeia. Utilizando a sua camada de Abstração Generalizada, a VERONA distingue-se por integrar funcionalidades complexas de blockchain, como contas, assinaturas e interoperabilidade, diretamente ao nível do protocolo. Esta abordagem permite a interação com aplicações de blockchain sem a necessidade de compreender as tecnologias subjacentes.1) Informação Básica Nome: VERONA (VERONA)III. Links Relacionados Link do site oficial: https://xion.burnt.com/ Whitepaper: https://xion.burnt.com/whitepaper.pdf Exploradores: https://explorer.burnt.com/ Mídias Sociais: https://x.com/burnt_xion Nota: A introdução ao projeto provém dos materiais publicados ou fornecidos pela equipa oficial do projeto, que é apenas para referência e não constitui aconselhamento de investimento. A HTX não se responsabiliza por quaisquer perdas diretas ou indiretas resultantes.

5 Visualizações TotaisPublicado em {updateTime}Atualizado em 2026.06.22

O que é VERONA

Discussões

Bem-vindo à Comunidade HTX. Aqui, pode manter-se informado sobre os mais recentes desenvolvimentos da plataforma e obter acesso a análises profissionais de mercado. As opiniões dos utilizadores sobre o preço de O (O) são apresentadas abaixo.

活动图片