Exploring Physical World AGI with "Visual Reasoning", ElorianAI Raises $55 Million

marsbitXuất bản vào 2026-04-23Cập nhật gần nhất vào 2026-04-23

Tóm tắt

ElorianAI, co-founded by ex-Google AI expert Andrew Dai and former AI specialist Yinfei Yang, has raised $55 million in early funding to develop next-generation AI systems with advanced visual reasoning capabilities. While current large models excel in text-based tasks like programming and math, they perform poorly in visual reasoning—even top models like Gemini only match a 3-year-old’s ability in basic visual benchmarks. The key limitation lies in the architecture of current vision-language models (VLMs), which first convert visual inputs into text before reasoning, losing critical spatial and structural information. ElorianAI aims to build a native multimodal model that processes and reasons directly in visual space, enabling deeper understanding of physical relationships, constraints, and environments. The company plans to release a state-of-the-art visual reasoning model by 2026, with potential applications in robotics, disaster management, engineering, healthcare, and AI hardware. By using high-quality, diverse, and synthetically generated data, ElorianAI intends to create models that don’t just perceive but truly understand and reason about the physical world—bringing us closer to visual AGI.

By Alpha Community

AI large models have surpassed average humans in certain areas, such as programming and mathematics. Reports indicate that Anthropic has almost achieved 100% AI programming internally, and Google's Gemini Deep Think solved 5 out of 6 problems in IMO 2025, reaching gold medal level.

However, in visual reasoning, even the leading Gemini 3 Pro only reached the level of a 3-year-old child on BabyVision, a benchmark testing basic visual reasoning abilities.

Why are large models strong in programming and mathematics but weak in visual reasoning? This is due to limitations in their "thinking process." Visual Language Models (VLMs) need to first convert visual input into language and then perform text-based reasoning. However, many visual tasks cannot be accurately described in words, resulting in poor visual reasoning capabilities of the models.

Andrew Dai, who worked at Google DeepMind for 14 years, teamed up with Apple's seasoned AI expert Yinfei Yang to establish a company called Elorian AI. Their goal is to elevate the model's visual reasoning ability from "child level" to "adult level," enabling the model to natively "think" within the "visual space" and thereby advance toward AGI in the physical world.

Elorian AI raised $55 million in early-stage funding co-led by Striker Venture Partners, Menlo Ventures, and Altimeter, with participation from 49 Palms and top AI scientists including Jeff Dean.

Pioneers in Multimodal Models Aim to Equip Visual Models with Reasoning Abilities

Andrew Dai, who is of Chinese descent, holds a bachelor's degree in computer science from Cambridge and a PhD in machine learning from Edinburgh. He interned at Google during his PhD and joined the company in 2012, staying for 14 years until starting his own business.

Image Source: Andrew Dai's LinkedIn

Shortly after joining Google, he co-authored the first paper on language model pre-training and supervised fine-tuning, "Semi-supervised Sequence Learning," with Quoc V. Le. This paper laid the foundation for the birth of GPT. Another foundational paper of his is "Glam: Efficient scaling of language models with mixture-of-experts," which paved the way for the now mainstream MoE architecture.

Image Source: Google

During his time at Google, he was deeply involved in almost all large model trainings, from Palm to Gemini 1.5 and Gemini 2.5. Under Jeff Dean's arrangement, he began leading the data division of Gemini (including synthetic data) in 2023, and the team later expanded to hundreds of people.

Image Source: Yinfei Yang's LinkedIn

Co-founding Elorian AI with Andrew Dai is Yinfei Yang, who worked at Google Research for four years, focusing on multimodal representation learning, before joining Apple to lead multimodal model R&D.

Image Source: arxiv

His representative research, "Scaling up visual and vision-language representation learning with noisy text supervision," advanced the development of multimodal representation learning.

Elorian AI's co-founders also include Seth Neel, who was an Assistant Professor at Harvard University and is an expert in data and AI.

Why discuss the groundbreaking papers written by Elorian AI's co-founders? Because their goal is not just engineering optimization but a paradigm shift at the foundational architecture level, upgrading AI from text-based intelligent understanding to vision-based intelligent understanding.

The current state of AI models is that, despite excelling in text-based tasks, even the most advanced frontier multimodal large models still stumble on the most basic visual grounding tasks.

For example, how to fit a part precisely into a mechanical device to make it run more accurately and efficiently? Such spatial physical tasks are simple for elementary school students but challenging for existing multimodal large models.

This brings us back to biology for clues. In the human brain, vision is the underlying substrate supporting many thinking processes. Humans' ability to use visual and spatial reasoning is far more ancient than language-based logical reasoning.

For instance, teaching someone to navigate a maze using language can be confusing, but drawing a sketch makes it instantly understandable.

Even a bird, without language, can recognize and reason about geographical features through vision to achieve global long-distance migration. This is a strong signal that vision is likely the correct direction for truly advancing machine reasoning.

So, imagine if, from the very beginning of model construction, this biological visual instinct is encoded into AI's genes, building a native multimodal model that "simultaneously understands and processes text, images, video, and audio," enabling the model to possess visual understanding capabilities. Andrew Dai and his team aim to build an innate "synesthete," teaching machines not only to "see" the world but also to "understand" it.

To Andrew Dai and his team, a deep understanding of the real "physical world" is the key to achieving the next leap in machine intelligence and ultimately reaching "Visual AGI."

VLMs with Post-Reasoning Are Not the Right Path to Visual Reasoning

There have been teams attempting this before. In fact, Andrew Dai's previous Gemini team was already among the global leaders in the multimodal field. However, traditional multimodal models are still primarily VLMs (Visual Language Models), built on a "two-step" logic: first converting visual input into language, then performing text-based reasoning (sometimes assisted by external tools).

However, post-reasoning inherently has limitations. On one hand, it is prone to model hallucinations; on the other, many visual tasks cannot be precisely described in words.

Additionally, visual generation models like NanoBanana excel in multimodal generation, but generation ability does not equal reasoning ability. The "thinking" before generation still relies on language models, not native reasoning capability.

To develop models that truly understand the spatial, structural, and relational complexities of the visual world, disruptive innovation at the underlying technology level is necessary.

So, how to innovate? Elorian AI's founders, with years of experience in the multimodal field, approach this by deeply integrating multimodal training with a new architecture specifically designed for multimodal reasoning. They abandon the traditional approach of treating images as static input, instead training models to directly interact with and manipulate visual representations to autonomously parse their structure, relationships, and physical constraints.

Of course, another core element is data, which is crucial to the performance and success of these models.

Andrew Dai stated that they place great importance on data quality, data mix ratios, data sources, and data diversity. They have innovated at the data layer, reconstructing the reasoning chain in visual space, and are extensively and deeply using synthetic data.

Combined, these efforts will give rise to new AI systems that move beyond simple visual "perception" to high-level visual "reasoning."

This AI system could be a visual reasoning foundation model: building a highly general but exceptionally proficient model in a specific capability set—visual reasoning.

As a general foundation model, its application areas should be broad.

First, in the robotics field, it could become the underlying neural center of powerful systems,赋予ing them the ability to operate autonomously in various unfamiliar environments.

For example, sending a robot to handle a sudden safety fault in a hazardous environment requires the robot to make quick and accurate instant decisions. If the robot lacks a foundation model with deep reasoning capabilities, people wouldn't dare let it randomly press buttons or operate levers. But if it has strong reasoning能力, it might think: "Before operating this panel, maybe I should pull this lever first to activate the safety mechanism."

Furthermore, in disaster management, models with visual reasoning could analyze satellite images to monitor and prevent forest fires. In engineering, they could accurately understand complex visual blueprints and system diagrams. The significance of this ability lies in the fact that the operating principles of the physical world are fundamentally different from the pure code world. You can't design an airplane wing just by typing a few lines of pure code.

However, Elorian AI's models and capabilities are currently still on paper. They plan to release a model in 2026 that achieves SOTA level in visual reasoning. At that time, we can verify if their results match their claims.

When AI Truly Possesses "Visual Reasoning" Ability, How Will It Change the Physical World?

To enable AI to understand and influence the real physical world, technology has iterated several times.

From image recognition in the traditional CV era, to image generation models/multimodal models in generative AI, to world models, the understanding of the physical world has been continuously enhanced.

Visual reasoning foundation models could take it a step further. Because achieving visual reasoning allows AI to understand the physical world more deeply, thereby achieving a higher level of machine intelligence.

Imagine, when models with deep understanding and fine operation empower the embodied intelligence industry and the AI hardware industry, it will greatly expand their application scope. For example, robots could perform more reliable industrial production or work in medical care; AI hardware, especially wearable devices, could become smarter personal assistants.

However, underlying these technologies is still data. As Andrew Dai mentioned earlier, data quality, data mix ratios, data sources, and data diversity all determine model performance.

In the physical AI field, Chinese companies, whether at the model level or the data level, are closer to world leadership compared to text large models. If they can leverage their advantages of richer data and application scenarios to accelerate iteration speed, then whether in embodied intelligence or AI hardware, whether applied in industry, healthcare, or homes, there is a greater opportunity to reach leading levels and potentially produce world-class enterprises.

Tiền kỹ thuật số thịnh hành

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

PancakeSwapCAKE

JUSTJST

Câu hỏi Liên quan

QWhat is the main goal of current Vision Language Models (VLMs) according to the article, and what are their limitations?

AThe main goal of VLMs is to process visual input by first converting it into language and then performing text-based reasoning. Their limitation is that many visual tasks cannot be accurately described with text, leading to poor visual reasoning capabilities.

QWho are the founders of Elorian AI and what are their backgrounds?

AThe founders are Andrew Dai, a former Google DeepMind researcher with 14 years of experience, and Yinfei Yang, an AI expert who worked at Google Research and Apple. Andrew Dai contributed to foundational papers in language model pre-training and MoE architecture, while Yinfei Yang focused on multimodal representation learning.

QHow does Elorian AI plan to improve AI's visual reasoning capabilities?

AElorian AI aims to develop a native multimodal model that processes text, images, video, and audio simultaneously. They focus on integrating multimodal training with new architectures designed for visual reasoning, directly interacting with visual representations to parse structures and physical constraints, and using high-quality, diverse synthetic data.

QWhat potential applications are mentioned for AI with advanced visual reasoning skills?

AApplications include robotics for autonomous operations in unfamiliar environments, disaster management through satellite image analysis, engineering by interpreting complex visual diagrams, and enhancing AI hardware like wearable devices for personal assistance.

QWhen does Elorian AI plan to release their model, and what is the expected achievement?

AElorian AI plans to release a model in 2026 that achieves state-of-the-art (SOTA) performance in visual reasoning, aiming to elevate capabilities from 'child-level' to 'adult-level'.

Nội dung Liên quan

Thượng nghị sĩ đề xuất thành lập cục đấu tranh với hoạt động kinh doanh tiền mã hóa của Trump

Thượng nghị sĩ Chuck Schumer đề xuất thành lập một cục liên bang để điều tra và chống lại các hoạt động kinh doanh tiền mã hóa của cựu Tổng thống Donald Trump, với cáo buộc xung đột lợi ích và tham nhũng. Cục này sẽ hoạt động độc lập, có quyền điều tra và thẩm vấn, đồng thời cho phép cá nhân và công tố viên bang kiện các quan chức và công ty để thu hồi tiền bất hợp pháp. Schumer chỉ ra rằng Trump và gia đình đã kiếm được hơn 1,4 tỷ USD và 4 tỷ USD tương ứng từ các dự án tiền mã hóa sau khi Trump trở lại làm tổng thống vào năm 2025, bao gồm nền tảng World Liberty Financial và các meme coin TRUMP, MELANIA. Mặc dù Nhà Trắng phủ nhận có xung đột lợi ích, vụ việc đã làm trì hoãn dự luật CLARITY. Các đảng viên Dân chủ muốn thêm điều khoản cấm tổng thống, quan chức và gia đình họ thu lợi từ tiền mã hóa khi đương nhiệm. Trước đó, Schumer cùng các đồng nghiệp đã cố gắng đưa các sửa đổi tương tự vào dự luật về stablecoin (GENIUS) nhưng không thành công vào mùa hè 2025.

cryptonews.ru29 phút trước

Thượng nghị sĩ đề xuất thành lập cục đấu tranh với hoạt động kinh doanh tiền mã hóa của Trump

cryptonews.ru29 phút trước

Lãnh đạo HIVE: GPU phục vụ AI mang lại doanh thu cao gấp 10 lần mỗi giờ so với các trang trại khai thác tiền điện tử

Lãnh đạo HIVE: GPU cho AI mang lại doanh thu gấp 10 lần mỗi giờ so với trang trại khai thác tiền điện tử Trong một cuộc thảo luận gần đây, Chủ tịch HIVE Frank Holmes đã tiết lộ sự chênh lệch lớn về khả năng sinh lời giữa việc cung cấp sức mạnh tính toán cho trí tuệ nhân tạo (AI) và khai thác Bitcoin. Một cụm 504 GPU Nvidia B200 của HIVE trong cơ sở hạ tầng AI của Bell Canada ở Manitoba tạo ra khoảng 2,90 đô la mỗi giờ trên mỗi GPU. Trong khi đó, các giàn khai thác Bitcoin của công ty chỉ tạo ra khoảng 0,12 đô la mỗi giờ – chênh lệch hơn 20 lần. Đây là cốt lõi trong chiến lược của HIVE: đầu tư phần lớn vào kinh doanh AI có lợi nhuận cao hơn, đồng thời vẫn duy trì hoạt động khai thác Bitcoin đáng kể. Năm tài chính 2026, HIVE đạt tốc độ băm trung bình 22,2 EH/s, chiếm khoảng 3% tổng tốc độ băm mạng Bitcoin và khai thác được 2.885 BTC. Doanh thu tổng thể của HIVE năm 2026 đạt 297,8 triệu đô la, tăng 158%. Bộ phận AI và điện toán hiệu suất cao (HPC) mới, BUZZ HPC, đóng góp 19,5 triệu đô la. HIVE đã chuyển hướng sang AI từ sớm, với khoản đầu tư 70 triệu đô la vào chip Nvidia cách đây ba năm, giúp họ có lợi thế khi cơn sốt AI bùng nổ. Công ty đã nhận được đánh giá "Mua" đầu tiên từ một nhà phân tích, ký thỏa thuận cung cấp điện toán đám mây GPU trị giá khoảng 220 triệu đô la với Bell và startup AI Cohere, cũng như huy động 75 triệu đô la thông qua phát hành trái phiếu. Dự án đầy tham vọng nhất của HIVE là một trung tâm dữ liệu AI công suất 320 MW đang được xây dựng ở Vùng Toronto, dự kiến chứa hơn 100.000 GPU. Khi hoạt động đầy đủ vào nửa cuối năm 2027, cơ sở này dự kiến tạo ra khoảng 360 triệu đô la doanh thu định kỳ hàng năm. HIVE không đơn độc trong xu hướng này. Các công ty khai thác đối thủ như MARA, Hut 8 và Terawulf cũng đang chuyển hướng nguồn lực năng lượng hạn chế sang các hợp đồng AI/HPC sinh lời hơn, trước bối cảnh biên lợi nhuận khai thác Bitcoin giảm và giá mỗi hash giảm. Mục tiêu trước mắt của HIVE là tăng gấp mười lần doanh thu hàng năm từ AI/HPC vào cuối năm tài chính, phụ thuộc vào việc đưa vào vận hành đúng hạn trung tâm dữ liệu Toronto và các hợp đồng dịch vụ đám mây GPU tiếp theo.

cryptonews.ru30 phút trước

Lãnh đạo HIVE: GPU phục vụ AI mang lại doanh thu cao gấp 10 lần mỗi giờ so với các trang trại khai thác tiền điện tử

cryptonews.ru30 phút trước

“Mỗi ngày không có luật lệ là một mất mát về vốn”: Grayscale gửi kiến nghị đến Thượng viện liên quan đến Đạo luật CLARITY

Công ty Grayscale Investments, nền tảng đầu tư tài sản kỹ thuật số lớn nhất thế giới, đã kêu gọi lãnh đạo Thượng viện Mỹ đưa dự luật CLARITY ra bỏ phiếu trước kỳ nghỉ tháng 8 của Quốc hội. Trong thư ngỏ, Grayscale nhấn mạnh thị trường crypto Mỹ đã hoạt động nhiều năm mà không có khung pháp lý toàn diện, phải chịu sự "điều tiết thông qua hành vi cưỡng chế". Dự luật CLARITY được cho là sẽ thiết lập các quy tắc rõ ràng, phân định thẩm quyền giữa CFTC và SEC. Grayscale cảnh báo mỗi ngày không có sự rõ ràng về quy định là một ngày nhân tài, đổi mới và vốn chảy sang các khu vực pháp lý khác như Singapore hay Abu Dhabi. Công ty cho rằng việc thông qua dự luật sẽ thúc đẩy thị trường ETF, ETP crypto và các công cụ đầu tư có quy định khác tại Mỹ. Dự luật đã vượt qua Ủy ban Ngân hàng và các phiên điều trần, hiện chỉ chờ xem xét tại phiên họp toàn Thượng viện. Grayscale thúc giục bỏ phiếu trước kỳ nghỉ tháng 8. Tuy nhiên, dự luật vẫn gặp chỉ trích. Tổng chưởng lý New York Letitia James lo ngại nó có thể làm suy yếu khả năng chống lừa đảo crypto của các bang. Nhiều ngân hàng lớn cũng đề nghị cấm các cơ chế thưởng cho việc nắm giữ stablecoin, lo sợ rủi ro rút tiền gửi hàng loạt.

cryptonews.ru31 phút trước

“Mỗi ngày không có luật lệ là một mất mát về vốn”: Grayscale gửi kiến nghị đến Thượng viện liên quan đến Đạo luật CLARITY

cryptonews.ru31 phút trước

Grayscale nhận định rằng sự xuất hiện của 3.000 kho lưu trữ trực tuyến nắm giữ tài sản trị giá hơn 7 tỷ USD sẽ trở thành bước đột phá tiếp theo trong lĩnh vực tiền điện tử

Grayscale dự đoán rằng kho lưu trữ blockchain (vaults) sẽ là đột phá tiếp theo trong lĩnh vực tiền mã hóa, đi vào xu hướng chính thống. Các kho này gộp vốn của nhà đầu tư vào các chiến lược tạo lợi nhuận, được quản lý bởi các curator chuyên nghiệp, tương tự như CLO truyền thống nhưng vận hành trên blockchain. Điểm khác biệt lớn nằm ở cơ sở hạ tầng: thay vì dựa vào các trung gian, các kho blockchain sử dụng hợp đồng thông minh trên các mạng như Ethereum, Base và Solana để quản lý tài sản và xử lý giao dịch. Điều này mang lại tính minh bạch theo thời gian thực, có khả năng giảm chi phí và tăng tính thanh khoản. Thị trường hiện còn nhỏ, với hơn 3.000 kho nắm giữ tài sản trị giá khoảng 7 tỷ USD, chủ yếu tập trung vào các chiến lược stablecoin. Trong khi đó, thị trường CLO toàn cầu lên tới 1.500 tỷ USD. Trở ngại chính là môi trường pháp lý, đặc biệt tại Mỹ, nơi các quy định về chứng khoán có thể gây ra thách thức nếu các curator có quyền quyết định đáng kể. Sự chấp nhận của các nhà đầu tư tổ chức sẽ phụ thuộc vào việc ngành có thể kết hợp hiệu quả của hợp đồng thông minh với việc đáp ứng các tiêu chuẩn pháp lý và vận hành truyền thống hay không.

cryptonews.ru32 phút trước

Grayscale nhận định rằng sự xuất hiện của 3.000 kho lưu trữ trực tuyến nắm giữ tài sản trị giá hơn 7 tỷ USD sẽ trở thành bước đột phá tiếp theo trong lĩnh vực tiền điện tử

cryptonews.ru32 phút trước

Nhà đầu tư tiền mã hóa lớn rút 57 triệu USD từ staking HYPE

Một "cá voi" lớn đã rút 1,02 triệu token HYPE (trị giá 57 triệu USD) từ staking và chuyển chúng đến các sàn giao dịch FalconX và Coinbase Prime, dấu hiệu cho thấy khả năng bán ra. Nhà đầu tư này đã mua số token này 17 tháng trước với giá trung bình 18 USD mỗi token, và nếu bán, lợi nhuận ước tính có thể vượt 39 triệu USD. Hành động này diễn ra trong bối cảnh lượng token HYPE chờ rút khỏi staking đã tăng hơn gấp đôi trong một tuần, từ 4,09 triệu lên 9,1 triệu token (tương đương từ 241 triệu USD lên 496,6 triệu USD). Tổng số token HYPE được staking cũng giảm nhẹ. Các động thái này có thể phản ánh tâm lý bi quan hoặc điều chỉnh danh mục từ một số nhà đầu tư dài hạn. Giá HYPE hiện giao dịch quanh mức 54,4 USD, giảm 16,15% trong tháng qua, trong bối cảnh áp lực bán gia tăng. Trước đó, vào tháng 6, đồng sáng lập BitMEX Arthur Hayes cũng đã bán toàn bộ 247.334 token HYPE của mình.

cryptonews.ru34 phút trước

Nhà đầu tư tiền mã hóa lớn rút 57 triệu USD từ staking HYPE

cryptonews.ru34 phút trước

Giao dịch

Giao ngay

Bài viết Nổi bật

Làm thế nào để Mua AR

Chào mừng bạn đến với HTX.com! Chúng tôi đã làm cho mua Arweave (AR) trở nên đơn giản và thuận tiện. Làm theo hướng dẫn từng bước của chúng tôi để bắt đầu hành trình tiền kỹ thuật số của bạn.Bước 1: Tạo Tài khoản HTX của BạnSử dụng email hoặc số điện thoại của bạn để đăng ký tài khoản miễn phí trên HTX. Trải nghiệm hành trình đăng ký không rắc rối và mở khóa tất cả tính năng. Nhận Tài khoản của tôiBước 2: Truy cập Mua Crypto và Chọn Phương thức Thanh toán của BạnThẻ Tín dụng/Ghi nợ: Sử dụng Visa hoặc Mastercard của bạn để mua Arweave (AR) ngay lập tức.Số dư: Sử dụng tiền từ số dư tài khoản HTX của bạn để giao dịch liền mạch.Bên thứ ba: Chúng tôi đã thêm những phương thức thanh toán phổ biến như Google Pay và Apple Pay để nâng cao sự tiện lợi.P2P: Giao dịch trực tiếp với người dùng khác trên HTX.Thị trường mua bán phi tập trung (OTC): Chúng tôi cung cấp những dịch vụ được thiết kế riêng và tỷ giá hối đoái cạnh tranh cho nhà giao dịch.Bước 3: Lưu trữ Arweave (AR) của BạnSau khi mua Arweave (AR), lưu trữ trong tài khoản HTX của bạn. Ngoài ra, bạn có thể gửi đi nơi khác qua chuyển khoản blockchain hoặc sử dụng để giao dịch những tiền kỹ thuật số khác.Bước 4: Giao dịch Arweave (AR)Giao dịch Arweave (AR) dễ dàng trên thị trường giao ngay của HTX. Chỉ cần truy cập vào tài khoản của bạn, chọn cặp giao dịch, thực hiện giao dịch và theo dõi trong thời gian thực. Chúng tôi cung cấp trải nghiệm thân thiện với người dùng cho cả người mới bắt đầu và người giao dịch dày dạn kinh nghiệm.

Tổng lượt xem 696Xuất bản vào 2024.12.11Cập nhật vào 2026.06.02

Thảo luận

Chào mừng đến với Cộng đồng HTX. Tại đây, bạn có thể được thông báo về những phát triển nền tảng mới nhất và có quyền truy cập vào thông tin chuyên sâu về thị trường. Ý kiến của người dùng về giá của AR (AR) được trình bày dưới đây.