Exploring Physical World AGI with "Visual Reasoning", ElorianAI Raises $55 Million

marsbitPublished on 2026-04-23Last updated on 2026-04-23

Abstract

ElorianAI, co-founded by ex-Google AI expert Andrew Dai and former AI specialist Yinfei Yang, has raised $55 million in early funding to develop next-generation AI systems with advanced visual reasoning capabilities. While current large models excel in text-based tasks like programming and math, they perform poorly in visual reasoning—even top models like Gemini only match a 3-year-old’s ability in basic visual benchmarks. The key limitation lies in the architecture of current vision-language models (VLMs), which first convert visual inputs into text before reasoning, losing critical spatial and structural information. ElorianAI aims to build a native multimodal model that processes and reasons directly in visual space, enabling deeper understanding of physical relationships, constraints, and environments. The company plans to release a state-of-the-art visual reasoning model by 2026, with potential applications in robotics, disaster management, engineering, healthcare, and AI hardware. By using high-quality, diverse, and synthetically generated data, ElorianAI intends to create models that don’t just perceive but truly understand and reason about the physical world—bringing us closer to visual AGI.

By Alpha Community

AI large models have surpassed average humans in certain areas, such as programming and mathematics. Reports indicate that Anthropic has almost achieved 100% AI programming internally, and Google's Gemini Deep Think solved 5 out of 6 problems in IMO 2025, reaching gold medal level.

However, in visual reasoning, even the leading Gemini 3 Pro only reached the level of a 3-year-old child on BabyVision, a benchmark testing basic visual reasoning abilities.

Why are large models strong in programming and mathematics but weak in visual reasoning? This is due to limitations in their "thinking process." Visual Language Models (VLMs) need to first convert visual input into language and then perform text-based reasoning. However, many visual tasks cannot be accurately described in words, resulting in poor visual reasoning capabilities of the models.

Andrew Dai, who worked at Google DeepMind for 14 years, teamed up with Apple's seasoned AI expert Yinfei Yang to establish a company called Elorian AI. Their goal is to elevate the model's visual reasoning ability from "child level" to "adult level," enabling the model to natively "think" within the "visual space" and thereby advance toward AGI in the physical world.

Elorian AI raised $55 million in early-stage funding co-led by Striker Venture Partners, Menlo Ventures, and Altimeter, with participation from 49 Palms and top AI scientists including Jeff Dean.

Pioneers in Multimodal Models Aim to Equip Visual Models with Reasoning Abilities

Andrew Dai, who is of Chinese descent, holds a bachelor's degree in computer science from Cambridge and a PhD in machine learning from Edinburgh. He interned at Google during his PhD and joined the company in 2012, staying for 14 years until starting his own business.

Image Source: Andrew Dai's LinkedIn

Shortly after joining Google, he co-authored the first paper on language model pre-training and supervised fine-tuning, "Semi-supervised Sequence Learning," with Quoc V. Le. This paper laid the foundation for the birth of GPT. Another foundational paper of his is "Glam: Efficient scaling of language models with mixture-of-experts," which paved the way for the now mainstream MoE architecture.

Image Source: Google

During his time at Google, he was deeply involved in almost all large model trainings, from Palm to Gemini 1.5 and Gemini 2.5. Under Jeff Dean's arrangement, he began leading the data division of Gemini (including synthetic data) in 2023, and the team later expanded to hundreds of people.

Image Source: Yinfei Yang's LinkedIn

Co-founding Elorian AI with Andrew Dai is Yinfei Yang, who worked at Google Research for four years, focusing on multimodal representation learning, before joining Apple to lead multimodal model R&D.

Image Source: arxiv

His representative research, "Scaling up visual and vision-language representation learning with noisy text supervision," advanced the development of multimodal representation learning.

Elorian AI's co-founders also include Seth Neel, who was an Assistant Professor at Harvard University and is an expert in data and AI.

Why discuss the groundbreaking papers written by Elorian AI's co-founders? Because their goal is not just engineering optimization but a paradigm shift at the foundational architecture level, upgrading AI from text-based intelligent understanding to vision-based intelligent understanding.

The current state of AI models is that, despite excelling in text-based tasks, even the most advanced frontier multimodal large models still stumble on the most basic visual grounding tasks.

For example, how to fit a part precisely into a mechanical device to make it run more accurately and efficiently? Such spatial physical tasks are simple for elementary school students but challenging for existing multimodal large models.

This brings us back to biology for clues. In the human brain, vision is the underlying substrate supporting many thinking processes. Humans' ability to use visual and spatial reasoning is far more ancient than language-based logical reasoning.

For instance, teaching someone to navigate a maze using language can be confusing, but drawing a sketch makes it instantly understandable.

Even a bird, without language, can recognize and reason about geographical features through vision to achieve global long-distance migration. This is a strong signal that vision is likely the correct direction for truly advancing machine reasoning.

So, imagine if, from the very beginning of model construction, this biological visual instinct is encoded into AI's genes, building a native multimodal model that "simultaneously understands and processes text, images, video, and audio," enabling the model to possess visual understanding capabilities. Andrew Dai and his team aim to build an innate "synesthete," teaching machines not only to "see" the world but also to "understand" it.

To Andrew Dai and his team, a deep understanding of the real "physical world" is the key to achieving the next leap in machine intelligence and ultimately reaching "Visual AGI."

VLMs with Post-Reasoning Are Not the Right Path to Visual Reasoning

There have been teams attempting this before. In fact, Andrew Dai's previous Gemini team was already among the global leaders in the multimodal field. However, traditional multimodal models are still primarily VLMs (Visual Language Models), built on a "two-step" logic: first converting visual input into language, then performing text-based reasoning (sometimes assisted by external tools).

However, post-reasoning inherently has limitations. On one hand, it is prone to model hallucinations; on the other, many visual tasks cannot be precisely described in words.

Additionally, visual generation models like NanoBanana excel in multimodal generation, but generation ability does not equal reasoning ability. The "thinking" before generation still relies on language models, not native reasoning capability.

To develop models that truly understand the spatial, structural, and relational complexities of the visual world, disruptive innovation at the underlying technology level is necessary.

So, how to innovate? Elorian AI's founders, with years of experience in the multimodal field, approach this by deeply integrating multimodal training with a new architecture specifically designed for multimodal reasoning. They abandon the traditional approach of treating images as static input, instead training models to directly interact with and manipulate visual representations to autonomously parse their structure, relationships, and physical constraints.

Of course, another core element is data, which is crucial to the performance and success of these models.

Andrew Dai stated that they place great importance on data quality, data mix ratios, data sources, and data diversity. They have innovated at the data layer, reconstructing the reasoning chain in visual space, and are extensively and deeply using synthetic data.

Combined, these efforts will give rise to new AI systems that move beyond simple visual "perception" to high-level visual "reasoning."

This AI system could be a visual reasoning foundation model: building a highly general but exceptionally proficient model in a specific capability set—visual reasoning.

As a general foundation model, its application areas should be broad.

First, in the robotics field, it could become the underlying neural center of powerful systems,赋予ing them the ability to operate autonomously in various unfamiliar environments.

For example, sending a robot to handle a sudden safety fault in a hazardous environment requires the robot to make quick and accurate instant decisions. If the robot lacks a foundation model with deep reasoning capabilities, people wouldn't dare let it randomly press buttons or operate levers. But if it has strong reasoning能力, it might think: "Before operating this panel, maybe I should pull this lever first to activate the safety mechanism."

Furthermore, in disaster management, models with visual reasoning could analyze satellite images to monitor and prevent forest fires. In engineering, they could accurately understand complex visual blueprints and system diagrams. The significance of this ability lies in the fact that the operating principles of the physical world are fundamentally different from the pure code world. You can't design an airplane wing just by typing a few lines of pure code.

However, Elorian AI's models and capabilities are currently still on paper. They plan to release a model in 2026 that achieves SOTA level in visual reasoning. At that time, we can verify if their results match their claims.

When AI Truly Possesses "Visual Reasoning" Ability, How Will It Change the Physical World?

To enable AI to understand and influence the real physical world, technology has iterated several times.

From image recognition in the traditional CV era, to image generation models/multimodal models in generative AI, to world models, the understanding of the physical world has been continuously enhanced.

Visual reasoning foundation models could take it a step further. Because achieving visual reasoning allows AI to understand the physical world more deeply, thereby achieving a higher level of machine intelligence.

Imagine, when models with deep understanding and fine operation empower the embodied intelligence industry and the AI hardware industry, it will greatly expand their application scope. For example, robots could perform more reliable industrial production or work in medical care; AI hardware, especially wearable devices, could become smarter personal assistants.

However, underlying these technologies is still data. As Andrew Dai mentioned earlier, data quality, data mix ratios, data sources, and data diversity all determine model performance.

In the physical AI field, Chinese companies, whether at the model level or the data level, are closer to world leadership compared to text large models. If they can leverage their advantages of richer data and application scenarios to accelerate iteration speed, then whether in embodied intelligence or AI hardware, whether applied in industry, healthcare, or homes, there is a greater opportunity to reach leading levels and potentially produce world-class enterprises.

Trending Cryptos

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

PancakeSwapCAKE

JUSTJST

The Long-Awaited Cryptocurrency Bill Known as the 'Clarity Act' Reaches a Critical Juncture: White House to Review It This Weekend

The future of the U.S. cryptocurrency market regulation, the CLARITY Act, is at a critical stage, with its advancement potentially hinging on a White House decision this weekend. The Trump administration is reviewing a new bipartisan ethics proposal from Senators Tom Tillis (R) and Ruben Gallego (D). This counter-proposal aims to empower state attorneys general to prosecute federal officials if the Justice Department fails to enforce ethics and conflict-of-interest rules, addressing Democratic concerns about DOJ enforcement under the current administration. If an agreement on these ethics provisions is reached, a Senate vote on the CLARITY Act could proceed, though it still requires 60 votes for passage. The bill, which passed the Senate Banking Committee, seeks to clarify the regulatory roles of the SEC and CFTC over crypto assets and establish a comprehensive market structure. Key compromises include regulations for stablecoin yield programs, allowing rewards tied to transactions or platform use while limiting interest-like payments based solely on token holding. Failure to reach an ethics deal could stall the bill's progress, prolonging uncertainty for stablecoin rewards and related regulations.

cryptonews.ru36m ago

The Long-Awaited Cryptocurrency Bill Known as the 'Clarity Act' Reaches a Critical Juncture: White House to Review It This Weekend

cryptonews.ru36m ago

Interview with Robinhood Executive: Meme + Tokenized US Stocks as "Barbell" Customer Acquisition Strategy, All Business Lines Achieve Hundreds of Millions in Revenue

Interview with Robinhood executive Johann Kerbrat reveals the company's "barbell" customer acquisition strategy for its new Robinhood Chain, combining meme tokens with tokenized stocks. Three weeks after mainnet launch, the chain has seen over $3B in weekly DEX volume and 105M transactions. Kerbrat explains the logic behind the permissionless chain: meme tokens attract DeFi users, while tokenized real-world assets (RWA), currently over 90 US stocks and ETFs accessible in 120+ countries, serve global users. The goal is to bring Robinhood's 27 million funded accounts on-chain by simplifying DeFi with a user-friendly interface, exemplified by features like Robinhood Earn which offers yield without requiring wallet management. Built on Arbitrum's technology stack for its speed, low cost, and Ethereum's security, the chain focuses on financial products like Earn, spot trading, and perpetuals. Kerbrat downplays direct competition with platforms like Base, emphasizing the goal of expanding the overall market for on-chain assets. He details selective partnerships (e.g., Morpho, Lighter) based on compliance, unique UX, and differentiation. While regulatory clarity is pending for US perpetuals, the expansion continues via Bitstamp in Europe. Finally, Kerbrat positions Robinhood as a "super app" integrating stocks, options, crypto, banking, and AI trading, with all major business lines generating hundreds of millions in revenue. For the chain, current priority is driving adoption over maximizing gas fee revenue.

marsbit2h ago

Interview with Robinhood Executive: Meme + Tokenized US Stocks as "Barbell" Customer Acquisition Strategy, All Business Lines Achieve Hundreds of Millions in Revenue

marsbit2h ago

Fidelity Q3 Report: BTC, ETH, and SOL Continue to Build Bottoms; How Much Further Will This Crypto Bear Market Go?

Fidelity's Q3 Crypto Signal Report analyzes the current bear market, noting Bitcoin (BTC), Ethereum (ETH), and Solana (SOL) are in a prolonged bottoming phase. Key indicators like the weighted Net Unrealized Profit/Loss (NUPL) have turned negative (-0.01), signaling the market is slightly below its aggregate cost basis, with BTC acting as the primary stabilizing asset. BTC's dominance has risen to 68%, indicating a lack of capital rotation to other digital assets. Performance has been weak across the board, with BTC, ETH, and SOL down significantly year-to-date. Market sentiment is depressed, exacerbated by substantial outflows from spot ETPs and a challenging macro environment. The report compares the current ~203-day downtrend to historical ~300-day bottoming cycles, suggesting the process may be two-thirds complete, with late 2026 as a potential timeframe to monitor. For Bitcoin, NUPL at 0.09 indicates cautious sentiment, while momentum signals remain negative. The Yardstick metric points to potential undervaluation relative to network security (hashrate). Ethereum's NUPL is deep in the "capitulation" zone at -0.43, a historically positive signal for future returns, though its momentum and network fee revenue are negative. Solana shows the deepest NUPL at -0.72 but demonstrates relative resilience in on-chain activity and stablecoin transfer volume. The report concludes that while several metrics are near historical capitulation levels, a definitive market bottom has not yet been established. The path forward likely involves continued consolidation, with BTC's relative strength and fundamental on-chain usage for ETH and SOL providing key areas for investor observation.

marsbit2h ago

Fidelity Q3 Report: BTC, ETH, and SOL Continue to Build Bottoms; How Much Further Will This Crypto Bear Market Go?

marsbit2h ago

How did Bitcoin and Ethereum perform in August? Here are the key facts you need to know

Bitcoin and Ethereum ended July with gains but entered August with historically weak performance indicators. Past monthly trends suggest a negative closing for both cryptocurrencies in August cannot be ruled out. In July, Ethereum outperformed Bitcoin, gaining 18.5% compared to Bitcoin's 7%. However, Ethereum's August performance since 2016 is mixed, having closed higher in only 4 out of 10 years. Its best August was in 2017 with a 92.86% surge, while its worst was in 2018 with a 34.79% decline. While Ethereum's average August return is 6.74%, its median return is -1.74%, indicating the positive average is heavily skewed by a few strong rallies. Bitcoin's historical August data is also not decisively bullish. Its average return is 1.06%, but the median is -6.99%, showing negative closings are more common. Recently, Bitcoin has shown volatility in August, gaining 8.13% in 2025, 2.95% in 2024, and declining 4.02% in 2023. In summary, while the average August returns for both assets are positive, the negative median returns imply that strong rallies inflate the average, and a negative monthly performance is a more typical outcome for August.

cryptonews.ru2h ago

How did Bitcoin and Ethereum perform in August? Here are the key facts you need to know

cryptonews.ru2h ago

Senator Proposes Creation of Bureau to Combat Trump's Cryptocurrency Business

A US senator has proposed the creation of a federal bureau to investigate and combat corruption related to cryptocurrency businesses, explicitly citing former President Donald Trump's ventures. The proposed bureau, to be led by a Senate-confirmed board, would operate independently with investigative powers. The senator's bill also aims to allow private citizens and state attorneys general to sue officials and companies to recover illicitly obtained funds, seeking to eliminate conflicts of interest for public servants in crypto. The proposal follows reports that Trump and his family earned over $5.4 billion from crypto projects like World Liberty Financial and specific memecoins after his 2025 return to the presidency. While the White House denies any conflict of interest, these controversies have stalled the CLARITY bill, with Democrats pushing for amendments to ban crypto profits for presidents, Congress members, and their families. Similar amendments were previously rejected from the GENIUS stablecoin law in 2025.

cryptonews.ru4h ago

Senator Proposes Creation of Bureau to Combat Trump's Cryptocurrency Business

cryptonews.ru4h ago

Trading

Spot

Hot Articles

How to Buy AR

Welcome to HTX.com! We've made purchasing Arweave (AR) simple and convenient. Follow our step-by-step guide to embark on your crypto journey.Step 1: Create Your HTX AccountUse your email or phone number to sign up for a free account on HTX. Experience a hassle-free registration journey and unlock all features.Get My AccountStep 2: Go to Buy Crypto and Choose Your Payment MethodCredit/Debit Card: Use your Visa or Mastercard to buy Arweave (AR) instantly.Balance: Use funds from your HTX account balance to trade seamlessly.Third Parties: We've added popular payment methods such as Google Pay and Apple Pay to enhance convenience.P2P: Trade directly with other users on HTX.Over-the-Counter (OTC): We offer tailor-made services and competitive exchange rates for traders.Step 3: Store Your Arweave (AR)After purchasing your Arweave (AR), store it in your HTX account. Alternatively, you can send it elsewhere via blockchain transfer or use it to trade other cryptocurrencies.Step 4: Trade Arweave (AR)Easily trade Arweave (AR) on HTX's spot market. Simply access your account, select your trading pair, execute your trades, and monitor in real-time. We offer a user-friendly experience for both beginners and seasoned traders.

11.5k Total ViewsPublished 2024.03.29Updated 2026.06.02

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AR (AR) are presented below.

Exploring Physical World AGI with "Visual Reasoning", ElorianAI Raises $55 Million

Abstract

Pioneers in Multimodal Models Aim to Equip Visual Models with Reasoning Abilities

VLMs with Post-Reasoning Are Not the Right Path to Visual Reasoning

When AI Truly Possesses "Visual Reasoning" Ability, How Will It Change the Physical World?

Trending Cryptos

Related Questions

Related Reads

The Long-Awaited Cryptocurrency Bill Known as the 'Clarity Act' Reaches a Critical Juncture: White House to Review It This Weekend

Interview with Robinhood Executive: Meme + Tokenized US Stocks as "Barbell" Customer Acquisition Strategy, All Business Lines Achieve Hundreds of Millions in Revenue

Fidelity Q3 Report: BTC, ETH, and SOL Continue to Build Bottoms; How Much Further Will This Crypto Bear Market Go?

How did Bitcoin and Ethereum perform in August? Here are the key facts you need to know

Senator Proposes Creation of Bureau to Combat Trump's Cryptocurrency Business

Trading

Hot Articles

How to Buy AR

Discussions

Top Questions