Exploring Physical World AGI with "Visual Reasoning", ElorianAI Raises $55 Million

marsbitDipublikasikan tanggal 2026-04-23Terakhir diperbarui pada 2026-04-23

Abstrak

ElorianAI, co-founded by ex-Google AI expert Andrew Dai and former AI specialist Yinfei Yang, has raised $55 million in early funding to develop next-generation AI systems with advanced visual reasoning capabilities. While current large models excel in text-based tasks like programming and math, they perform poorly in visual reasoning—even top models like Gemini only match a 3-year-old’s ability in basic visual benchmarks. The key limitation lies in the architecture of current vision-language models (VLMs), which first convert visual inputs into text before reasoning, losing critical spatial and structural information. ElorianAI aims to build a native multimodal model that processes and reasons directly in visual space, enabling deeper understanding of physical relationships, constraints, and environments. The company plans to release a state-of-the-art visual reasoning model by 2026, with potential applications in robotics, disaster management, engineering, healthcare, and AI hardware. By using high-quality, diverse, and synthetically generated data, ElorianAI intends to create models that don’t just perceive but truly understand and reason about the physical world—bringing us closer to visual AGI.

By Alpha Community

AI large models have surpassed average humans in certain areas, such as programming and mathematics. Reports indicate that Anthropic has almost achieved 100% AI programming internally, and Google's Gemini Deep Think solved 5 out of 6 problems in IMO 2025, reaching gold medal level.

However, in visual reasoning, even the leading Gemini 3 Pro only reached the level of a 3-year-old child on BabyVision, a benchmark testing basic visual reasoning abilities.

Why are large models strong in programming and mathematics but weak in visual reasoning? This is due to limitations in their "thinking process." Visual Language Models (VLMs) need to first convert visual input into language and then perform text-based reasoning. However, many visual tasks cannot be accurately described in words, resulting in poor visual reasoning capabilities of the models.

Andrew Dai, who worked at Google DeepMind for 14 years, teamed up with Apple's seasoned AI expert Yinfei Yang to establish a company called Elorian AI. Their goal is to elevate the model's visual reasoning ability from "child level" to "adult level," enabling the model to natively "think" within the "visual space" and thereby advance toward AGI in the physical world.

Elorian AI raised $55 million in early-stage funding co-led by Striker Venture Partners, Menlo Ventures, and Altimeter, with participation from 49 Palms and top AI scientists including Jeff Dean.

Pioneers in Multimodal Models Aim to Equip Visual Models with Reasoning Abilities

Andrew Dai, who is of Chinese descent, holds a bachelor's degree in computer science from Cambridge and a PhD in machine learning from Edinburgh. He interned at Google during his PhD and joined the company in 2012, staying for 14 years until starting his own business.

Image Source: Andrew Dai's LinkedIn

Shortly after joining Google, he co-authored the first paper on language model pre-training and supervised fine-tuning, "Semi-supervised Sequence Learning," with Quoc V. Le. This paper laid the foundation for the birth of GPT. Another foundational paper of his is "Glam: Efficient scaling of language models with mixture-of-experts," which paved the way for the now mainstream MoE architecture.

Image Source: Google

During his time at Google, he was deeply involved in almost all large model trainings, from Palm to Gemini 1.5 and Gemini 2.5. Under Jeff Dean's arrangement, he began leading the data division of Gemini (including synthetic data) in 2023, and the team later expanded to hundreds of people.

Image Source: Yinfei Yang's LinkedIn

Co-founding Elorian AI with Andrew Dai is Yinfei Yang, who worked at Google Research for four years, focusing on multimodal representation learning, before joining Apple to lead multimodal model R&D.

Image Source: arxiv

His representative research, "Scaling up visual and vision-language representation learning with noisy text supervision," advanced the development of multimodal representation learning.

Elorian AI's co-founders also include Seth Neel, who was an Assistant Professor at Harvard University and is an expert in data and AI.

Why discuss the groundbreaking papers written by Elorian AI's co-founders? Because their goal is not just engineering optimization but a paradigm shift at the foundational architecture level, upgrading AI from text-based intelligent understanding to vision-based intelligent understanding.

The current state of AI models is that, despite excelling in text-based tasks, even the most advanced frontier multimodal large models still stumble on the most basic visual grounding tasks.

For example, how to fit a part precisely into a mechanical device to make it run more accurately and efficiently? Such spatial physical tasks are simple for elementary school students but challenging for existing multimodal large models.

This brings us back to biology for clues. In the human brain, vision is the underlying substrate supporting many thinking processes. Humans' ability to use visual and spatial reasoning is far more ancient than language-based logical reasoning.

For instance, teaching someone to navigate a maze using language can be confusing, but drawing a sketch makes it instantly understandable.

Even a bird, without language, can recognize and reason about geographical features through vision to achieve global long-distance migration. This is a strong signal that vision is likely the correct direction for truly advancing machine reasoning.

So, imagine if, from the very beginning of model construction, this biological visual instinct is encoded into AI's genes, building a native multimodal model that "simultaneously understands and processes text, images, video, and audio," enabling the model to possess visual understanding capabilities. Andrew Dai and his team aim to build an innate "synesthete," teaching machines not only to "see" the world but also to "understand" it.

To Andrew Dai and his team, a deep understanding of the real "physical world" is the key to achieving the next leap in machine intelligence and ultimately reaching "Visual AGI."

VLMs with Post-Reasoning Are Not the Right Path to Visual Reasoning

There have been teams attempting this before. In fact, Andrew Dai's previous Gemini team was already among the global leaders in the multimodal field. However, traditional multimodal models are still primarily VLMs (Visual Language Models), built on a "two-step" logic: first converting visual input into language, then performing text-based reasoning (sometimes assisted by external tools).

However, post-reasoning inherently has limitations. On one hand, it is prone to model hallucinations; on the other, many visual tasks cannot be precisely described in words.

Additionally, visual generation models like NanoBanana excel in multimodal generation, but generation ability does not equal reasoning ability. The "thinking" before generation still relies on language models, not native reasoning capability.

To develop models that truly understand the spatial, structural, and relational complexities of the visual world, disruptive innovation at the underlying technology level is necessary.

So, how to innovate? Elorian AI's founders, with years of experience in the multimodal field, approach this by deeply integrating multimodal training with a new architecture specifically designed for multimodal reasoning. They abandon the traditional approach of treating images as static input, instead training models to directly interact with and manipulate visual representations to autonomously parse their structure, relationships, and physical constraints.

Of course, another core element is data, which is crucial to the performance and success of these models.

Andrew Dai stated that they place great importance on data quality, data mix ratios, data sources, and data diversity. They have innovated at the data layer, reconstructing the reasoning chain in visual space, and are extensively and deeply using synthetic data.

Combined, these efforts will give rise to new AI systems that move beyond simple visual "perception" to high-level visual "reasoning."

This AI system could be a visual reasoning foundation model: building a highly general but exceptionally proficient model in a specific capability set—visual reasoning.

As a general foundation model, its application areas should be broad.

First, in the robotics field, it could become the underlying neural center of powerful systems,赋予ing them the ability to operate autonomously in various unfamiliar environments.

For example, sending a robot to handle a sudden safety fault in a hazardous environment requires the robot to make quick and accurate instant decisions. If the robot lacks a foundation model with deep reasoning capabilities, people wouldn't dare let it randomly press buttons or operate levers. But if it has strong reasoning能力, it might think: "Before operating this panel, maybe I should pull this lever first to activate the safety mechanism."

Furthermore, in disaster management, models with visual reasoning could analyze satellite images to monitor and prevent forest fires. In engineering, they could accurately understand complex visual blueprints and system diagrams. The significance of this ability lies in the fact that the operating principles of the physical world are fundamentally different from the pure code world. You can't design an airplane wing just by typing a few lines of pure code.

However, Elorian AI's models and capabilities are currently still on paper. They plan to release a model in 2026 that achieves SOTA level in visual reasoning. At that time, we can verify if their results match their claims.

When AI Truly Possesses "Visual Reasoning" Ability, How Will It Change the Physical World?

To enable AI to understand and influence the real physical world, technology has iterated several times.

From image recognition in the traditional CV era, to image generation models/multimodal models in generative AI, to world models, the understanding of the physical world has been continuously enhanced.

Visual reasoning foundation models could take it a step further. Because achieving visual reasoning allows AI to understand the physical world more deeply, thereby achieving a higher level of machine intelligence.

Imagine, when models with deep understanding and fine operation empower the embodied intelligence industry and the AI hardware industry, it will greatly expand their application scope. For example, robots could perform more reliable industrial production or work in medical care; AI hardware, especially wearable devices, could become smarter personal assistants.

However, underlying these technologies is still data. As Andrew Dai mentioned earlier, data quality, data mix ratios, data sources, and data diversity all determine model performance.

In the physical AI field, Chinese companies, whether at the model level or the data level, are closer to world leadership compared to text large models. If they can leverage their advantages of richer data and application scenarios to accelerate iteration speed, then whether in embodied intelligence or AI hardware, whether applied in industry, healthcare, or homes, there is a greater opportunity to reach leading levels and potentially produce world-class enterprises.

Kripto yang Sedang Tren

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

PancakeSwapCAKE

JUSTJST

Pertanyaan Terkait

QWhat is the main goal of current Vision Language Models (VLMs) according to the article, and what are their limitations?

AThe main goal of VLMs is to process visual input by first converting it into language and then performing text-based reasoning. Their limitation is that many visual tasks cannot be accurately described with text, leading to poor visual reasoning capabilities.

QWho are the founders of Elorian AI and what are their backgrounds?

AThe founders are Andrew Dai, a former Google DeepMind researcher with 14 years of experience, and Yinfei Yang, an AI expert who worked at Google Research and Apple. Andrew Dai contributed to foundational papers in language model pre-training and MoE architecture, while Yinfei Yang focused on multimodal representation learning.

QHow does Elorian AI plan to improve AI's visual reasoning capabilities?

AElorian AI aims to develop a native multimodal model that processes text, images, video, and audio simultaneously. They focus on integrating multimodal training with new architectures designed for visual reasoning, directly interacting with visual representations to parse structures and physical constraints, and using high-quality, diverse synthetic data.

QWhat potential applications are mentioned for AI with advanced visual reasoning skills?

AApplications include robotics for autonomous operations in unfamiliar environments, disaster management through satellite image analysis, engineering by interpreting complex visual diagrams, and enhancing AI hardware like wearable devices for personal assistance.

QWhen does Elorian AI plan to release their model, and what is the expected achievement?

AElorian AI plans to release a model in 2026 that achieves state-of-the-art (SOTA) performance in visual reasoning, aiming to elevate capabilities from 'child-level' to 'adult-level'.

Bacaan Terkait

Goldman Sachs: Juli Menghancurkan Perdagangan Padat, Pasar Bull AS Tak Terputus Tapi Lebih Sulit

Menurut analisis Goldman Sachs, pasar saham AS pada Juli mengalami koreksi yang lebih didorong oleh penyesuaian posisi (deleveraging) ketimbang keruntuhan indeks utama. Meskipun S&P 500 relatif stabil (hanya turun <2% dari puncak), terjadi gejolak besar di bawah permukaan. Transaksi yang sebelumnya padat dan digerakkan momentum—seperti saham AI, teknologi berkecepatan tinggi, dan strategi Asia—mengalami tekanan likuiditas dan penjualan besar-besaran. Leverage pada sektor teknologi global turun ke level terendah dalam lima tahun terakhir, dan eksposur terhadap momentum berada pada persentil ke-28 dalam satu tahun terakhir. Perdebatan utama pada saham AI bergeser dari narasi menuju keberlanjutan profitabilitas, dengan kinerja perusahaan cloud seperti Microsoft dan Amazon yang lebih baik meredam kekhawatiran. Komunikasi Federal Reserve yang lebih hati-hati dan tekanan pada imbal hasil obligasi jangka panjang menambah ketidakpastian. Secara keseluruhan, dasar fundamental pasar saham AS masih solid didukung ekonomi dan pengeluaran modal AI, namun elastisitas kenaikan telah melemah. Tren bull market belum berakhir, tetapi periode ke depan akan lebih fluktuatif dan menuntut selektivitas yang lebih tinggi, bukan sekadar "beli dan tahan". Juli menjadi pengingat bahwa pasar tidak memberi imbalan untuk posisi yang terlalu padat dan leverage berlebihan.

marsbit25m yang lalu

Goldman Sachs: Juli Menghancurkan Perdagangan Padat, Pasar Bull AS Tak Terputus Tapi Lebih Sulit

marsbit25m yang lalu

Pembukaan Kunci Token Mingguan: IOTA, AERO, HYPE Mengalami Pembukaan Kecil

Pembaruan Token Minggu Ini: IOTA, AERO, HYPE Alami Peluncuran Kecil Hyperliquid (HYPE) - Jumlah peluncuran: 430,000 token - Nilai perkiraan: ~$22.56 juta - Hyperliquid adalah blockchain kinerja tinggi yang bertujuan membangun sistem keuangan terbuka yang sepenuhnya on-chain. Platform ini menyatukan likuiditas, aplikasi pengguna, dan aktivitas perdagangan untuk menampung semua aktivitas keuangan dalam satu ekosistem terpadu. Tersedia tautan ke Twitter dan situs web resmi proyek, serta bagan yang menunjukkan kurva pelepasan token.

marsbit33m yang lalu

Pembukaan Kunci Token Mingguan: IOTA, AERO, HYPE Mengalami Pembukaan Kecil

marsbit33m yang lalu

RUU Kripto yang Ditunggu-tunggu, Dikenal sebagai 'RUU Kejelasan', Berada di Tahap Kritis: Gedung Putih Akan Membahasnya Akhir Pekan Ini

Masa depan U.S. Crypto-Asset Regulatory Clarity Act atau "CLARITY Act" berada di titik kritis, dengan keputusan Gedung Putih Trump diharapkan pada akhir pekan ini terkait proposal etika baru dari kedua partai. Proposal alternatif oleh Senator Republik Tom Tillis dan Senator Demokrat Ruben Gallego ini bertujuan mengatasi kekhawatiran Demokrat dengan memberi wewenang kepada jaksa agung negara bagian untuk menuntut pejabat federal jika Departemen Kehakiman gagal menegakkan aturan etika. Versi sebelumnya didukung Gedung Putih namun dikritik karena penegakan tetap di Departemen Kehakiman dan batas waktu yang berakhir Januari 2029. Jika kesepakatan etika tercapai, Senat dapat melakukan pemungutan suara untuk CLARITY Act, meski memerlukan dukungan 60 senator. RUU yang telah disetujui Komite Perbankan Senat ini bertujuan memperjelas yurisdiksi SEC dan CFTC atas aset kripto, menciptakan kerangka pasar, mengatur imbal hasil stablecoin, dan memberikan perlindungan hukum tertentu bagi pengembang perangkat lunak. Kompromi pada aturan stablecoin membatasi pembayaran mirip bunga yang hanya berdasarkan kepemilikan token, tetapi mengizinkan imbalan terkait transaksi, pembayaran, program loyalitas, atau penggunaan platform. Kegagalan mencapai kesepakatan etika dapat kembali menghalangi kemajuan RUU ini di Senat, memperpanjang ketidakpastian regulasi untuk imbal hasil stablecoin dan implementasi ketentuan terkait dalam GENIUS Act.

cryptonews.ru1j yang lalu

RUU Kripto yang Ditunggu-tunggu, Dikenal sebagai 'RUU Kejelasan', Berada di Tahap Kritis: Gedung Putih Akan Membahasnya Akhir Pekan Ini

cryptonews.ru1j yang lalu

Wawancara dengan Eksekutif Robinhood: Strategi Akusisi 'Barbel' dengan Meme + Tokenisasi Saham AS, Semua Lini Bisnis Capai Pendapatan Ratusan Juta Dolar

Wawancara dengan Johann Kerbrat, Wakil Presiden Senior Robinhood, mengungkap strategi mereka untuk mendorong adopsi crypto melalui "strategi barbell": memadukan meme coin untuk menarik pengguna DeFi dengan tokenisasi saham AS (RWA) untuk menjangkau pengguna global yang kesulitan mengakses pasar modal tradisional. Robinhood Chain, yang baru diluncurkan tiga minggu, telah mencatat volume perdagangan mingguan lebih dari $30 miliar dan lebih dari 1,05 miliar transaksi. Strategi intinya adalah memindahkan 27 juta akun berdananya ke ekosistem blockchain dengan menyederhanakan pengalaman DeFi. Produk seperti Robinhood Earn memungkinkan pengguna mendapatkan yield aset kripto tanpa harus mengelola dompet atau private key. Tokenisasi saham, yang kini mencakup 90+ saham di 120+ negara, menawarkan solusi seperti perdagangan 24/7 dan akses internasional. Kerbrat menekankan bahwa mereka memilih stack teknologi Arbitrum daripada membangun L1 sendiri untuk memanfaatkan keamanan Ethereum dan likuiditas EVM. Fokus mereka adalah memperluas pasar secara keseluruhan, bukan bersaing langsung dengan platform seperti Base. Kemitraan dengan proyek DeFi dipilih berdasarkan kesesuaian regulasi, kemampuan membangun pengalaman unik, dan diferensiasi. Visi jangka panjang Robinhood adalah menjadi "aplikasi super" yang memenuhi semua kebutuhan keuangan pengguna, dengan semua lini bisnisnya telah menghasilkan pendapatan miliaran dolar.

marsbit2j yang lalu

Wawancara dengan Eksekutif Robinhood: Strategi Akusisi 'Barbel' dengan Meme + Tokenisasi Saham AS, Semua Lini Bisnis Capai Pendapatan Ratusan Juta Dolar

marsbit2j yang lalu

Laporan Fidelity Q3: BTC, ETH, dan SOL Terus Membentuk Dasar, Seberapa Jauh Lagi Bear Market Kripto Saat Ini?

Laporan Q3 Fidelity menilai bahwa pasar kripto masih dalam fase penurunan (bear market), dengan BTC, ETH, dan SOL terus membentuk dasar. Indikator seperti NUPL Tertimbang yang turun ke -0.01 dan dominasi BTC yang naik ke 68% menunjukkan sentimen pasar yang melemah dan modal cenderung berkonsentrasi pada aset terbesar. Untuk Bitcoin, NUPL di 0.09 dan indikator Yardstick yang mendekati level historis rendah memberikan sinyal positif untuk investor jangka panjang, meski momentum harga masih negatif. Penyesuaian telah berlangsung sekitar 203 hari, dan mengacu pada siklus sebelumnya, Fidelity mencatat bahwa Oktober 2026 bisa menjadi jendela waktu yang patut diamati, namun bukan prediksi pasti bahwa dasar telah tercapai. Ethereum dan Solana menunjukkan tekanan lebih dalam dengan NUPL masing-masing -0.43 dan -0.72, mengindikasikan kerugian belum terealisasi yang signifikan. Namun, tingkat NUPL yang rendah ini secara historis dikaitkan dengan imbal hasil yang kuat di masa depan. Aktivitas penggunaan dan volume transfer stablecoin di kedua jaringan tetap tangguh, menunjukkan bahwa utilitas dasar terus berlanjut terlepas dari penurunan harga. Secara keseluruhan, laporan ini menggambarkan pasar yang sedang konsolidasi dengan beberapa indikator mendekati zona keputusasaan sejarah. BTC bertindak sebagai penstabil relatif, sementara ETH dan SOL menunjukkan tanda-tanda tekanan jual yang lebih berat namun dengan fundamental jaringan yang masih sehat.

marsbit2j yang lalu

Laporan Fidelity Q3: BTC, ETH, dan SOL Terus Membentuk Dasar, Seberapa Jauh Lagi Bear Market Kripto Saat Ini?

marsbit2j yang lalu

Trading

Spot

Artikel Populer

Cara Membeli AR

Selamat datang di HTX.com! Kami telah membuat pembelian Arweave (AR) menjadi mudah dan nyaman. Ikuti panduan langkah demi langkah kami untuk memulai perjalanan kripto Anda.Langkah 1: Buat Akun HTX AndaGunakan alamat email atau nomor ponsel Anda untuk mendaftar akun gratis di HTX. Rasakan perjalanan pendaftaran yang mudah dan buka semua fitur.Dapatkan Akun SayaLangkah 2: Buka Beli Kripto, lalu Pilih Metode Pembayaran AndaKartu Kredit/Debit: Gunakan Visa atau Mastercard Anda untuk membeli Arweave (AR) secara instan.Saldo: Gunakan dana dari saldo akun HTX Anda untuk melakukan trading dengan lancar.Pihak Ketiga: Kami telah menambahkan metode pembayaran populer seperti Google Pay dan Apple Pay untuk meningkatkan kenyamanan.P2P: Lakukan trading langsung dengan pengguna lain di HTX.Over-the-Counter (OTC): Kami menawarkan layanan yang dibuat khusus dan kurs yang kompetitif bagi para trader.Langkah 3: Simpan Arweave (AR) AndaSetelah melakukan pembelian, simpan Arweave (AR) di akun HTX Anda. Selain itu, Anda dapat mengirimkannya ke tempat lain melalui transfer blockchain atau menggunakannya untuk memperdagangkan mata uang kripto lainnya.Langkah 4: Lakukan trading Arweave (AR)Lakukan trading Arweave (AR) dengan mudah di pasar spot HTX. Cukup akses akun Anda, pilih pasangan perdagangan, jalankan trading, lalu pantau secara real-time. Kami menawarkan pengalaman yang ramah pengguna baik untuk pemula maupun trader berpengalaman.

889 Total TayanganDipublikasikan pada 2024.12.11Diperbarui pada 2026.06.02

Diskusi

Selamat datang di Komunitas HTX. Di sini, Anda bisa terus mendapatkan informasi terbaru tentang perkembangan platform terkini dan mendapatkan akses ke wawasan pasar profesional. Pendapat pengguna mengenai harga AR (AR) disajikan di bawah ini.

Exploring Physical World AGI with "Visual Reasoning", ElorianAI Raises $55 Million

Abstrak

Pioneers in Multimodal Models Aim to Equip Visual Models with Reasoning Abilities

VLMs with Post-Reasoning Are Not the Right Path to Visual Reasoning

When AI Truly Possesses "Visual Reasoning" Ability, How Will It Change the Physical World?

Kripto yang Sedang Tren

Pertanyaan Terkait

Bacaan Terkait

Goldman Sachs: Juli Menghancurkan Perdagangan Padat, Pasar Bull AS Tak Terputus Tapi Lebih Sulit

Pembukaan Kunci Token Mingguan: IOTA, AERO, HYPE Mengalami Pembukaan Kecil

RUU Kripto yang Ditunggu-tunggu, Dikenal sebagai 'RUU Kejelasan', Berada di Tahap Kritis: Gedung Putih Akan Membahasnya Akhir Pekan Ini

Wawancara dengan Eksekutif Robinhood: Strategi Akusisi 'Barbel' dengan Meme + Tokenisasi Saham AS, Semua Lini Bisnis Capai Pendapatan Ratusan Juta Dolar

Laporan Fidelity Q3: BTC, ETH, dan SOL Terus Membentuk Dasar, Seberapa Jauh Lagi Bear Market Kripto Saat Ini?

Trading

Artikel Populer

Cara Membeli AR

Diskusi

Kategori Populer

Tag Populer