Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbit發佈於 2026-04-13更新於 2026-04-13

文章摘要

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (powered by Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to nearly 57 million incorrect answers generated hourly. A critical finding is the prevalence of "unsubstantiated citations." For correct answers, the rate of citations that do not support the AI's summary surged from 37% to 56% with the Gemini 3 upgrade, making it difficult for users to verify information. The AI heavily relies on low-quality sources, with Facebook and Reddit being among its top-cited websites. Furthermore, the system is highly manipulable. A BBC journalist successfully "poisoned" it by publishing a fabricated article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing its use of the SimpleQA benchmark and an AI model (Oumi's own) to evaluate another AI. The company maintains its AI Overviews, combined with its search ranking systems, perform better than the underlying model alone. Critics note this defense does little to bolster user confidence in the feature's reliability.

Author: Claude, Deep Tide TechFlow

Deep Tide Guide: A recent test conducted by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is approximately 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is disseminating misinformation on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA, developed by OpenAI, to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducted in two rounds: one in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's massive scale. Google processes approximately 5 trillion search queries annually. With a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unsubstantiated citations."

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had the problem of "unsubstantiated citations," meaning the links attached to the AI summary did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it is increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The heavy reliance on low-quality sources by AI Overviews exacerbates this problem. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% rate in accurate answers.

BBC Journalist's Fake Article "Poisons" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by stating that the search AI feature is built on the same ranking and security mechanisms used to block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several concerns about Oumi's study. A Google spokesperson called the research "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model, HallOumi, to judge another AI's performance, potentially introducing additional errors; and the test content does not reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. However, Google emphasized that AI Overviews, leveraging the search ranking system, performs better in accuracy than the model alone.

Nevertheless, as PCMag pointed out in a logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this likely does not enhance user confidence in your product's accuracy.

你可能也喜歡

TradeXYZ脱离Hyperliquid单飞的可能性有多高

**TradeXYZ脱离Hyperliquid单飞的可能性分析** 随着TradeXYZ业务规模持续扩大，并长期占据Hyperliquid HIP-3市场超过90%的份额，近期社区对其是否可能脱离Hyperliquid并独立搭建交易平台的讨论升温。 **自立门户的驱动力** TradeXYZ已借助Hyperliquid成长为RWA永续合约市场的巨擘，其贡献的交易量占Hyperliquid总交易量70%以上。若选择自立门户，核心商业驱动力可能是为了**捕获全部底层交易手续费**。目前，TradeXYZ与Hyperliquid的交易费用分成固定为50/50，这意味着近一半收入归属于Hyperliquid。对于一个创造超4000亿美元交易量的项目而言，调整分成比例或独立运营以获取全部收益，具有显著吸引力。行业历史上也存在类似案例，如Uniswap、dYdX等项目在壮大后选择构建自有基础设施。 **制约与风险** 然而，TradeXYZ独立面临多重制约： 1. **性能挑战**：Hyperliquid提供了强大的底层技术（HyperCore），管理着撮合、清算等核心功能。自建同等性能的基础设施短期内难以实现，可能影响产品体验和市场叙事。 2. **渠道依赖**：尽管TradeXYZ拥有自己的交易前端，但大部分用户仍习惯通过Hyperliquid前端访问其流动性市场。Hyperliquid是其主要的分发渠道，失去渠道支持的价值难以估量。 3. **合作关系与信任**：TradeXYZ创始人Shoku是Hyperliquid的早期支持者，双方团队关系密切，存在深厚的信任基础。“背刺”合作方可能引发舆论争议，损害信誉。 **双输的潜在结局** 分析指出，若TradeXYZ真的选择分裂，很可能导致**双输局面**： * 对Hyperliquid而言，将失去最大的增长点和叙事基础（RWA合约龙头地位），总交易量可能骤降超50%，其代币HYPE估值或受重创。 * 对TradeXYZ而言，将面临从零开始构建基础设施和培养用户习惯的挑战，同时可能因竞争关系失去原有渠道优势，并在激烈的市场中被其他竞争者趁虚而入。 **结论** 综合来看，虽然TradeXYZ市场影响力日益增强，拥有议价能力，但考虑到技术依赖、渠道价值、双方信任关系以及可能引发的双输竞争，**其脱离Hyperliquid自立门户的可能性较低，也并非理智选择**。更可能的发展路径是在维持现有合作集成优势的基础上，协商更有利的条款，并逐步将发展重点向自身品牌和用户生态建设倾斜。

marsbit1 分鐘前

marsbit1 分鐘前

「天线宝宝」机器人上门做保洁，200元/小时，纯·人工·智能

一家名为Tau Robotics的美国机器人初创公司推出了一项“天线宝宝”机器人上门保洁服务，收费为30美元（约200人民币）每小时。该公司目前拥有Chelsea、Elon和Tony三款人形机器人，分别擅长厨房卫生间清洁、物品归位和深度清洁。值得注意的是，这些演示视频中的机器人动作均为人工远程遥控操作，而非自主人工智能。公司解释，这种“遥操”模式是当前技术条件下的折衷方案，旨在优先保证任务完成效率并收集真实家庭环境数据，以推动后续的自主化研发。所有展示视频均未加速播放，这相较于业内常见的加速演示更具真实性。文章指出，人形机器人进入家庭场景面临巨大挑战，主要体现在技术复杂性和环境非标准化方面。与国内企业优先聚焦工业场景不同，部分美国公司选择从家庭服务切入。支持者认为，人形设计更符合人类直觉，便于远程操控映射，并能提供独特的情绪价值。然而，消费者是否愿意为遥控机器人服务付费，市场接受度仍有待观察。目前，该服务仅在旧金山以邀请制形式开放体验。

marsbit15 分鐘前

marsbit15 分鐘前

从韩国到美国：多亏了AI，蓝领越来越吃香了

人工智能正在重塑劳动力市场，传统四年制大学学位的吸引力下降，而电工、焊工等技术型蓝领职业的需求和薪酬显著上升。数据显示，美国职业学校收入大幅增长，同时AI导致的白领裁员创下新高。供需两端变化推动年轻人重新规划职业路径，调查显示多数Z世代认为蓝领工作在AI时代更具就业保障。职业教育热度近半年急剧攀升，年轻人主动寻求不易被AI替代且有实际用工需求的职业，技术岗位招聘甚至比程序员更困难。专家指出，这既是经济计算，也源于对高额学位债务和纯屏幕工作的反思。技术职业薪资中位数已追平或超越许多需要学位的工作，且提供边学边赚的学徒路径及创业可能。韩国半导体高中毕业生就业率极高，进入三星等企业可获得丰厚薪酬。美国面临结构性技术工人短缺：大量婴儿潮一代即将退休，而新一代劳动力数量不足，缺口达数百万。同时，数据中心等基建扩张推高了相关技术工种需求。为此，摩根大通、Meta等企业正投资巨额资金开展培训项目。尽管市场信号明确，但社会对职业教育的传统偏见仍构成“认知差距”。产业界需主动推广，让公众认识到技术职业不仅收入稳定，还是通向创业的快速路径，以真正缓解人才短缺。

marsbit59 分鐘前

marsbit59 分鐘前

高通：AI 狂热消退，手机何时走出阴霾？

高通发布2026财年第三季度财报，收入99.5亿美元，同比下滑4%，略好于预期。手机业务营收50.9亿美元，同比下滑19.6%，主要受安卓阵营出货量下降及高端机型需求走弱影响。汽车业务在数字座舱带动下大幅增长61%，IoT业务增长9%。毛利率为53.1%，同比下降2.5个百分点，主要受存储等成本上涨挤压。净利润20亿美元，核心经营利润同比下滑41%。下季度营收指引为97-105亿美元，符合市场预期，但每股收益指引低于预期。手机市场需求持续疲软，存储涨价给成本带来压力。为寻求新增长，高通正积极布局数据中心AI业务，包括AI加速器、商用CPU、定制芯片和连接产品，并设定了2027财年50亿美元的收入目标。然而，市场对AI资本开支可持续性的担忧使其股价回落，数据中心业务贡献仍需时间兑现。在当前传统业务承压的背景下，公司估值更多依赖于手机基本盘的复苏与AI新业务的进展。

marsbit1 小時前

marsbit1 小時前

从TPU到自我进化的Agent，Jeff Dean如何判断AI的下一步

在2026年YC创业学校的访谈中，Google传奇工程师Jeff Dean分享了对AI未来发展的深刻洞见。他认为，AI竞争正从追求“更大的模型”转向“更好地组织智能”。当前AI的能力已接近初级工程师，但更重要的是构建能让AI长期工作、持续试错并自动验证的系统。 Jeff Dean指出，下一代AI的关键在于推理硬件、能量效率、数据搬运成本以及上下文工程。他特别强调，AI的实际成本往往不是计算本身，而是数据搬运的能耗。同时，他将上下文工程视为小团队的重要机会——通过组织领域知识、工具和工作流，让通用模型在特定场景中更可靠。对于创业公司，他提出了“1%法则”：应寻找当前通用模型成功率极低（接近0%或1%）的任务，这些往往是存在专有数据、专业验证或结构性盲区的领域，而非模型已能做到20%的任务。随着AI自动化执行成本下降，“问题选择”、“规格定义”和“品味”将变得更为稀缺和重要。他展望未来，AI将自动化科学方法本身，通过高速实验循环和廉价验证器加速发现。最终，最稀缺的能力仍将是清晰地识别和定义真正有价值的问题。

marsbit1 小時前

marsbit1 小時前

交易

現貨

Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

文章摘要

Correct Answers, Wrong Sources

BBC Journalist's Fake Article "Poisons" Results Within 24 Hours

Google's Rebuttal: The Test Itself Is Flawed

相關問答

你可能也喜歡

TradeXYZ脱离Hyperliquid单飞的可能性有多高

「天线宝宝」机器人上门做保洁，200元/小时，纯·人工·智能

从韩国到美国：多亏了AI，蓝领越来越吃香了

高通：AI 狂热消退，手机何时走出阴霾？

从TPU到自我进化的Agent，Jeff Dean如何判断AI的下一步

交易

熱門分類

熱門標籤