何恺明团队新作:删掉VAE和私有数据后,文生图竟然更强了

marsbitPublished on 2026-06-22Last updated on 2026-06-22

Abstract

何恺明团队发布极简文生图模型MiniT2I,挑战当前依赖复杂组件的行业范式。该模型摒弃了VAE编解码器、AdaLN条件注入、私有数据和强化学习对齐等常见设计,直接在像素空间使用流匹配目标进行训练。其采用改进的MM-JiT架构,用轻量文本适配器取代复杂的条件注入机制,使模型结构更简洁高效。 训练数据全程使用公开集,采用预训练加微调的两阶段模式。仅258M参数的B/16版本在多个基准测试中超越了参数量大数倍的同类模型,展现出极高的性价比。扩展后的L/16版本在风格、构图等方面已接近更大规模的先进模型。 研究证明了文生图模型可通过架构和训练流程的简化实现强大性能,可能推动该领域从“堆料”向“提纯”的范式转换。团队也指出了像素空间模型在patch边界伪影、高分辨率扩展和数据瓶颈等方面的当前局限。

文本生成图像的领域早已经是一片红海,看上去已经卷无可卷了。

想在当下训一个很牛的文生图模型,你需要什么?

如果从当下主流方案入手,那需要:预训练好的 VAE 编解码器、文本编码器的拼接、精心设计的条件注入机制、海量数据、RL 或 DPO 对齐阶段......

总体上,大家似乎默认了一个前提:做文生图,就是得这么复杂。

而何恺明团队却反其道而行之,在文生图模型领域做出了新的思考。他们发布了 MiniT2I —— 一个刻意追求极简的像素空间文生图模型

没有 VAE 编解码器,没有 AdaLN 条件注入,没有辅助损失函数,没有私有数据,没有 RL/DPO 对齐,纯粹的流匹配目标直接在像素上训练。258M 参数的 B/16 版本,在 GenEval 上达到 0.87,DPG-Bench 达到 84.2,超越了参数量大它数倍的同类像素空间模型。

MiniT2I 的核心主张是:如果把文本条件当作「带有语义信息的上下文 token」注入模型,文生图和类别条件的 ImageNet 生成在本质上并没有那么大的区别 —— 架构可以相似,算力可以相当,甚至数据量级也可以对齐。

  • 论文标题:A Minimalist Baseline for Text-to-Image Generation
  • 技术博客:https://peppaking8.github.io/#/post/minit2i
  • 开源地址:https://github.com/PeppaKing8/minit2i-jax

技术路线:每一步都在做减法

像素空间直出,不要 VAE

MiniT2I 的第一个设计选择就很激进:丢掉 VAE,直接在 RGB 像素上做去噪。

潜在扩散模型(Latent Diffusion)是当前主流范式,先用自编码器把图像压缩到低维空间再做扩散。这确实让高分辨率变得可行,但代价是引入了重建误差、额外的训练阶段、以及编码器 - 去噪器之间的目标不对齐问题。

MiniT2I 选择像素空间的理由很务实:对于 512×512 分辨率,用 16×16 的 patch 把图像切成 1024 个 token,序列长度完全在 Transformer 的舒适区内。去掉 VAE 后,单步前向的计算从~1379 GFLOPs 降到~570 GFLOPs(B/16 设置),而且不存在重建精度的上限问题 —— 去噪器能力有多强,输出就能有多好。

实验也证实了这一点:在相同参数预算下,像素模型的 FID 和潜在空间模型持平(18.7 vs 19.0),但单步成本低了 5 倍。

MM-JiT 架构:回归朴素 Transformer

SD3 的 MM-DiT 在每个 block 中用 AdaLN(Adaptive Layer Normalization)将时间步和池化文本编码注入网络 —— 每个子块需要计算 scale、shift 和 gate 参数,通过一个额外的 MLP 从条件向量生成。这是一套精巧的调制机制,但 MiniT2I 发现它并非必需。

MiniT2I 提出的 MM-JiT 架构做了两件事:

1. 加两层文本适配器:在联合注意力之前,插入两个轻量 Transformer block,让冻结的 T5 特征先「适应」去噪器的需求。

2. 删除 AdaLN 分支:不再通过额外路径注入时间步和全局文本信息。模型依然能感知噪声水平 —— 因为被噪声污染的图像本身就携带了时间步信息。

结果是一个接近标准预归一化 Transformer 的干净架构。去掉 AdaLN 后参数减少,但可以用相同算力预算换来更多层数(12 层 → 17 层)。FID 从 18.7 降到 13.7,同时架构本身更容易理解和修改。

训练数据:全公开,两阶段

MiniT2I 的训练数据同样追求极简:

  • 预训练:LLaVA-recaptioned CC12M(公开可用的 VLM 重标注数据集),250K 步
  • 微调:~12 万张高质量图文对(BLIP3o-60K + LAION DALL・E 3 Discord set + ShareGPT-4o-Image),40K 步

这种「预训练 - 微调」的两阶段模式完全对标 LLM 的训练范式:预训练买覆盖面,微调教模型什么是好答案。消融显示两者缺一不可 —— 只做预训练,图像质量可以但提示跟随很差;只做微调,模型看到的世界太窄,生成多样性坍塌。

结果:小模型,大表现

在像素空间文生图的对比中,MiniT2I 的性价比极为突出:

MiniT2I-B/16 仅用约 600M 总参数(含文本编码器),就在 GenEval 和 DPG-Bench 上超越了参数量 3-4 倍于己的模型。而且训练成本极低:B/32 消融模型在 8 张 H100 上只需约 3 天,总训练 FLOPs 与标准 ImageNet 200 epoch 实验相当。

扩展到 L/16(912M 参数)后,模型在风格多样性、空间关系和文字渲染方面都有明显进步,与 SD3-Medium(~2B 参数)在想象力场景上的生成质量相当甚至更优。

在更全面的 PRISM-Bench 评测中,MiniT2I-L/16 在风格、组合和想象力维度上表现出色(79.9、78.4、57.9),已经接近 SD3-Medium 水平。但在文字渲染(30.6 vs SD3 的 50.9)和命名实体(60.3 vs 66.3)上仍有差距 —— 团队坦承这是公开数据配方的固有局限,需要补充专项数据来弥补。

局限与展望

MiniT2I 是一条技术路线的概念验证,而非最终产品。团队诚实地指出了几个未解问题:

  • 像素空间的 patch 伪影:在 patch 边界处存在可测量的不连续(边界处梯度比非边界高 17-22%),潜在空间模型没有这个问题
  • CFG 在像素空间的副作用:高引导系数(~6)会将局部 token 推离数据流形,在没有解码器「平滑」的情况下直接暴露为视觉瑕疵
  • 分辨率天花板:当前在 512×512 工作良好,推向 4K+ 需要更长序列或更高效的注意力机制
  • 数据瓶颈:文字渲染和命名实体仍弱于工业系统,需要专项数据补强

MiniT2I 证明了现阶段的文生图不是只有顶尖工业实验室才能玩的游戏。

当一个 258M 参数的模型,用纯公开数据,在学术级算力上训练 3 天就能打败体量大数倍的对手时,或许文生图正在经历从「堆料」到「提纯」的范式转换

「T2I 不再是高不可攀的围墙。欢迎使用并改进它,打造更简洁的基线。」

本文来自微信公众号“机器之心”

Trending Cryptos

Related Questions

Q何恺明团队提出的MiniT2I模型与传统文生图模型在架构上的核心区别是什么?

AMiniT2I模型的核心区别在于追求极简设计:它舍弃了传统文生图模型中普遍使用的VAE(变分自编码器)进行编解码,直接在高维像素空间(RGB)上进行去噪和生成。同时,它移除了SD3等模型中常用的AdaLN(自适应层归一化)条件注入机制,回归到更朴素、易于理解的预归一化Transformer架构,并通过增加轻量级文本适配器来处理文本条件。这种简化不仅降低了计算开销,还使得模型在相同算力预算下可以堆叠更多层数。

QMiniT2I模型为何选择在像素空间而非潜在空间进行训练?这样做的优缺点是什么?

AMiniT2I选择在像素空间训练,主要原因是为了消除由VAE带来的重建误差、额外训练阶段以及编解码器与去噪器目标不对齐的问题。具体操作上,它将512×512分辨率图像通过16×16的patch转化为1024个token序列,这仍在Transformer的有效处理范围内。 优点包括:1) 去除了VAE的重建精度上限,模型性能直接取决于去噪器能力;2) 显著降低了计算成本(B/16配置下单步计算量从约1379 GFLOPs降至约570 GFLOPs)。 缺点包括:1) 可能在patch边界产生视觉上的不连续(伪影);2) 在高分辨率(如4K)生成时面临序列长度增长带来的挑战。

Q文章提到MiniT2I模型的训练数据策略是怎样的?为什么采用两阶段模式?

AMiniT2I的训练数据策略遵循极简和公开原则,采用两阶段模式: 1. **预训练阶段**:使用公开的、由VLM(视觉语言模型)重新标注的LLaVA-recaptioned CC12M数据集,约训练250K步,目的是让模型广泛学习视觉-语言关联。 2. **微调阶段**:使用约12万张来自BLIP3o-60K、LAION DALL・E 3 Discord set和ShareGPT-4o-Image的高质量图文对,约训练40K步,目的是提升生成质量和提示跟随能力。 采用两阶段模式的原因类似大语言模型(LLM)训练:预训练提供广阔的知识覆盖面,微调则专注于提升生成结果的质量和忠实度。消融实验表明,两者缺一不可,否则会导致生成多样性差或提示跟随能力弱。

Q根据文章,MiniT2I模型在性能评测中表现如何?有哪些优势和不足?

AMiniT2I模型(特别是258M参数的B/16版本)在评测中展现出极高的性价比: **优势**: 1. 在GenEval(0.87)和DPG-Bench(84.2)等评测中,超越了参数量数倍于它的同类像素空间模型。 2. 训练成本低,B/32消融模型在8张H100上仅需约3天。 3. 扩展到L/16版本(912M参数)后,在风格多样性、空间关系和想象力场景上表现出色,接近或优于参数量约2B的SD3-Medium。 **不足**: 1. 在文字渲染和命名实体生成方面,与SD3等顶尖工业模型仍有明显差距(如PRISM-Bench中文字渲染得分为30.6 vs 50.9)。 2. 由于依赖公开数据,在特定领域和精细概念上存在数据瓶颈。 3. 像素空间生成存在patch边界伪影和高CFG引导系数下的视觉瑕疵问题。

Q文章认为MiniT2I模型的意义是什么?它可能给文生图领域带来什么影响?

AMiniT2I模型的意义在于进行了一次重要的“概念验证”,挑战了文生图领域普遍存在的“复杂堆料”范式。它证明,通过极简的架构设计(去除VAE、简化条件注入)、仅使用公开数据以及学术级可负担的算力(3天训练),完全可以训练出性能优异的文生图模型。 这可能会给领域带来以下影响: 1. **范式转换**:推动文生图研究从一味追求模型规模、私有数据和复杂组件(如VAE、RL/DPO)的“堆料”竞赛,转向更关注架构“提纯”、设计简洁性和训练效率。 2. **降低门槛**:让更多学术研究机构和资源有限的团队能够参与到前沿的文生图模型研发和改进中,促进了领域的开放性和可复现性。 3. **启发未来方向**:指明了在像素空间直接操作、简化调制机制、以及采用LLM式的两阶段数据训练等方向上的潜力,为后续研究提供了清晰、可复现的基线。

Related Reads

When Billions Begin to Operate Everything by Voice, How Far is ‘All Assets on Chain’?

In June 2026, WeChat began a limited rollout of "Xiaowei," its native AI assistant. This move is more than an upgrade to a smarter chatbot; it signals a crucial step from "universal internet access" toward the broader vision of "full asset tokenization." Xiaowei, powered primarily by WeChat's in-house WeLM model, demonstrates four key capabilities: 1) direct voice/web chat control of app functions, 2) automated access to mini-programs for services, 3) instant comprehension and summarization of complex documents like PDFs, and 4) generating functional mini-program prototypes from simple natural language requests. This represents a fundamental shift from GUI (Graphical User Interface) to LUI (Language User Interface), eliminating friction in human-digital interaction. The rollout is pivotal because it brings AI Agents to China's massive user base with zero friction—no new app downloads or accounts needed. This "seamless access" mirrors past platform revolutions like the App Store or WeChat Mini-Programs, potentially unlocking a global AI Agent market projected to grow from $7.92 billion in 2025 to nearly $295 billion by 2035. The article argues that China's internet evolution has moved from "connecting everyone" to "putting all services online." The next phase is "tokenizing all assets"—a concept broader than just Real World Assets (RWA) like real estate. It encompasses tokenizing personal assets like social influence, attention, and credit history. RWA tokenization itself is forecast to explode from $35 billion in 2025 to over $500 billion in 2026. The convergence of ubiquitous AI Agents and rapidly tokenizing assets points to a future paradigm for wealth management. Your AI Agent could autonomously manage a globally diversified, tokenized portfolio based on your preferences. Initiatives like EXIO Group's full-stack RWA services aim to lower investment barriers, paralleling WeChat's democratization of AI access. In conclusion, the launch of Xiaowei is not merely a technical upgrade but a historic inflection point. It marks AI Agents' transition from niche tools to essential utilities and accelerates the movement toward a future where voice commands seamlessly interact with tokenized value, redefining humanity's relationship with the digital and financial worlds.

marsbit52m ago

When Billions Begin to Operate Everything by Voice, How Far is ‘All Assets on Chain’?

marsbit52m ago

SoftBank CEO Masayoshi Son's New Trillion-Dollar "Gamble"

SoftBank founder Masayoshi Son is embroiled in a new trillion-dollar "bet" on Physical AI and humanoid robotics, even as his massive wager on OpenAI faces uncertainty ahead of its potential IPO. Recent reports reveal OpenAI's steep losses—$85 billion net loss by Q1 2026 and a $38.5 billion loss in 2025—casting doubt on its path to a trillion-dollar valuation. SoftBank, OpenAI's second-largest external shareholder with a planned 13% stake, stands to gain hugely if OpenAI succeeds. Undeterred, Son is already pushing forward with his next ambitious venture: consolidating SoftBank's AI and robotics assets into a new U.S.-based company named "Roze," targeting a $100 billion IPO as early as late 2026. This move aligns with his belief that Physical AI, merging AI cognition with robotic physical execution, is the next trillion-dollar frontier. Son's confidence stems from recent AI wins; SoftBank's stock surged and he briefly regained the title of Asia's richest person, largely due to OpenAI's soaring valuation. However, his aggressive strategy has raised internal concerns about over-reliance on OpenAI and strained finances. With competitors like Anthropic advancing rapidly and OpenAI's IPO timing uncertain, Son is racing to capitalize on the AI boom. His long-term vision for Physical AI includes a decade of investments in robotics, from Boston Dynamics to recent acquisitions like ABB's robotics unit, and a planned $1 trillion investment in U.S.-based AI robotics industrial parks. Yet, challenges remain: humanoid robotics firms like Figure AI lack the clear revenue paths of AI software companies, and Roze's lofty valuation faces skepticism. For Son, these bets are also driven by an unfulfilled promise of massive returns to key investors like Saudi Arabia's PIF. Despite risks, he continues to double down, betting that the fusion of AI and physical machines will define the next technological era.

marsbit59m ago

SoftBank CEO Masayoshi Son's New Trillion-Dollar "Gamble"

marsbit59m ago

Trading

Spot
Futures

Hot Articles

What is ₿O₿

Bitcoin Bob ($₿o₿): Pioneering Bitcoin-Centric DeFi Through Hybrid Layer-2 Innovation In an era where the digital economy is rapidly evolving, Bitcoin Bob ($₿o₿) emerges as a revolutionary project aiming to enhance Bitcoin's utility in the decentralized finance (DeFi) sector. Officially launched in May 2024, Bitcoin Bob, also known as Build on Bitcoin (BOB), represents a hybrid Layer-2 blockchain solution that melds Bitcoin’s renowned security and immutability with Ethereum's programmability. This initiative seeks to fill a crucial gap in the Bitcoin ecosystem by facilitating the integration of smart contracts and decentralized applications while maintaining the core principles of trust and security inherent to Bitcoin. With significant backing from prominent venture capitalists, Bitcoin Bob is positioned to redefine the role of Bitcoin in the DeFi landscape, making it a cornerstone of decentralized financial operations globally. What Is Bitcoin Bob, $₿o₿? At its core, Bitcoin Bob is a hybrid blockchain solution designed to enhance the functionality of Bitcoin. The main objective of the project is to enable decentralized finance on Bitcoin, facilitating swift and seamless transactions while ensuring high levels of security. Bitcoin Bob employs advanced technology, specifically a hybrid layer-2 architecture that combines Bitcoin's security attributes with the programmability and flexibility of the Ethereum Virtual Machine (EVM). This pragmatic approach allows the project to operate effectively without compromising the fundamental values of Bitcoin, making it a monumental step in bridging the gap between traditional Bitcoin holders and the emerging DeFi ecosystem. One of the standout features of Bitcoin Bob is its role in providing a trust-minimized environment through innovative mechanisms, such as optimistic rollups initially relying on Ethereum, transitioning eventually to full Bitcoin integration. This hybrid system is designed to ensure that the vast liquidity present in Bitcoin is not only preserved but also utilized effectively in various DeFi protocols. Who Is the Creator of Bitcoin Bob, $₿o₿? The creative force behind Bitcoin Bob is co-founder and CEO Alexei Zamyatin, who brings a wealth of experience and knowledge from his extensive background in the cryptocurrency space. Zamyatin holds a PhD in Computer Science and has been actively involved in Bitcoin development since 2015. His deep understanding of both Bitcoin and Ethereum ecosystems plays a crucial role in shaping Bitcoin Bob’s vision and technological underpinnings. Alongside Zamyatin is co-founder Dominik Harz, who serves as the Chief Technology Officer (CTO). Together, the duo has cultivated a team of talented individuals with a shared passion for pushing the boundaries of blockchain technology, ensuring Bitcoin Bob's innovative stature in the market. Who Are the Investors of Bitcoin Bob, $₿o₿? Bitcoin Bob has successfully garnered support from a range of prominent investors and venture capital firms that recognize its potential to transform the Bitcoin landscape. In March 2024, the project completed a robust $10 million seed funding round, led by Castle Island Ventures, with notable participation from firms like Coinbase Ventures and Bankless Ventures. Shortly afterward, in July 2024, Bitcoin Bob secured an additional $1.6 million in strategic funding. This round was co-led by Ledger Ventures and featured angels from various prominent firms such as BlackRock, Aave, and Curve. The strong financial backing reflects an industry-wide recognition of Bitcoin Bob’s innovative approach to unlocking Bitcoin’s potential in the DeFi space. This funding is crucial not only for the project’s continued development but also for establishing an incubator to foster Bitcoin-native decentralized applications (dApps) aimed specifically at meeting the needs of a growing user base. How Does Bitcoin Bob, $₿o₿ Work? The operational mechanics of Bitcoin Bob are rooted in its hybrid rollup architecture, which is designed to combine the benefits of Bitcoin's security with the versatility of Ethereum’s EVM. The project employs a phased security model that outlines its interaction with users and developers in the following manner: Phase 1 – The initial phase operates as an optimistic rollup on Ethereum, wherein transactions are processed with a promising expectation of validity, paving the way for future developments on Bitcoin. Phase 2 – As the project transitions, it will integrate Bitcoin finality through Bitcoin Staking, leveraging the Babylon Network to enhance security. This mechanism requires validators to lock up Bitcoin, thus verifying BOB transactions, which not only enhances security but also creates yield prospects for participants. Phase 3 – The forward-looking vision for Bitcoin Bob is to fully integrate with Bitcoin, using innovative technologies such as BitVM and zero-knowledge proofs to facilitate off-chain computation while retaining the security integrity of Bitcoin. Key innovations such as BitVM2, a trust-minimized bridge protocol co-authored by Zamyatin, are critical to the project's functionality, allowing for Bitcoin deposits and withdrawals without the need for extensive network reliance. This enables the ecosystem to efficiently connect with Ethereum and other compatible chains, creating a streamlined and effective interaction model for users and developers. Timeline of Bitcoin Bob, $₿o₿ Understanding the evolution of Bitcoin Bob involves tracking its important milestones: 2019: Alexei Zamyatin and Dominik Harz establish a research firm focused on blockchain solutions, laying the groundwork for future projects. March 2024: Bitcoin Bob successfully raises $10 million in a seed funding round, marking its entrance into the competitive blockchain landscape. May 1, 2024: The official mainnet launch occurs, showcasing the project’s capabilities with significant user adoption and total value locked (TVL). July 2024: The project attracts an additional $1.6 million in strategic funding for establishing its incubator, aimed at fostering Bitcoin-driven innovations. October 2024: Bitcoin Bob releases a “Vision Paper,” detailing its hybrid layer-2 design and forward-looking strategies. 2025: Expected rollout of Phase 2 features, focusing on Bitcoin finality and BitVM bridges aimed at enhancing overall functionality. Conclusion: Redefining Bitcoin’s Role in Decentralized Finance Bitcoin Bob ($₿o₿) is not just another blockchain project; it represents a paradigm shift in the way Bitcoin can interact with broader financial applications. By meticulously combining Bitcoin's security with Ethereum's flexibility, Bitcoin Bob aims to reshape the DeFi landscape, bridging the gap between digital currency and decentralized applications. With a robust technological framework, strong leadership, and strategic funding, Bitcoin Bob is well-positioned to establish itself as a fundamental player in the cryptocurrency ecosystem, unlocking new dimensions of liquidity and utility for Bitcoin. As the project continues to evolve and expand, it promises to usher in a new era of innovation, proving that Bitcoin's potential extends far beyond being a mere store of value, but rather as a cornerstone of the future financial landscape. As the project advances through its anticipated phases, all eyes will be on Bitcoin Bob, particularly regarding its commitment to incorporating decentralized principles and ensuring that users can enjoy the full benefits of DeFi anchored by Bitcoin.

241 Total ViewsPublished 2025.06.30Updated 2025.06.30

What is ₿O₿

How to Buy O

Welcome to HTX.com! We've made purchasing O1 exchange (O) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy O1 exchange (O) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your O1 exchange (O)After purchasing your O1 exchange (O), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade O1 exchange (O)Easily trade O1 exchange (O) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

361 Total ViewsPublished 2026.06.19Updated 2026.06.19

How to Buy O

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of O (O) are presented below.

活动图片