Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbit發佈於 2026-04-13更新於 2026-04-13

文章摘要

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (powered by Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to nearly 57 million incorrect answers generated hourly. A critical finding is the prevalence of "unsubstantiated citations." For correct answers, the rate of citations that do not support the AI's summary surged from 37% to 56% with the Gemini 3 upgrade, making it difficult for users to verify information. The AI heavily relies on low-quality sources, with Facebook and Reddit being among its top-cited websites. Furthermore, the system is highly manipulable. A BBC journalist successfully "poisoned" it by publishing a fabricated article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing its use of the SimpleQA benchmark and an AI model (Oumi's own) to evaluate another AI. The company maintains its AI Overviews, combined with its search ranking systems, perform better than the underlying model alone. Critics note this defense does little to bolster user confidence in the feature's reliability.

Author: Claude, Deep Tide TechFlow

Deep Tide Guide: A recent test conducted by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is approximately 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is disseminating misinformation on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA, developed by OpenAI, to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducted in two rounds: one in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's massive scale. Google processes approximately 5 trillion search queries annually. With a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unsubstantiated citations."

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had the problem of "unsubstantiated citations," meaning the links attached to the AI summary did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it is increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The heavy reliance on low-quality sources by AI Overviews exacerbates this problem. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% rate in accurate answers.

BBC Journalist's Fake Article "Poisons" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by stating that the search AI feature is built on the same ranking and security mechanisms used to block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several concerns about Oumi's study. A Google spokesperson called the research "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model, HallOumi, to judge another AI's performance, potentially introducing additional errors; and the test content does not reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. However, Google emphasized that AI Overviews, leveraging the search ranking system, performs better in accuracy than the model alone.

Nevertheless, as PCMag pointed out in a logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this likely does not enhance user confidence in your product's accuracy.

相關問答

QWhat was the accuracy rate of Google's AI Overviews feature as tested by Oumi, and how many errors does this translate to per hour given Google's search volume?

AThe accuracy rate of Google's AI Overviews was found to be 91% in the test. Given Google's annual volume of 5 trillion searches, this 9% error rate translates to over 57 million inaccurate answers generated every hour.

QAccording to the Oumi study, what was the trend in 'unsubstantiated citations' between the Gemini 2 and Gemini 3 versions of the AI Overviews?

AThe problem of 'unsubstantiated citations' (where the provided links did not support the AI's answer) increased from 37% with Gemini 2 to 56% with the upgraded Gemini 3.

QWhich low-quality websites were identified as major sources frequently cited by Google's AI Overviews?

AFacebook and Reddit were identified as the second and fourth most frequently cited sources by the AI Overviews feature.

QHow did a BBC journalist demonstrate the vulnerability of Google's AI Overviews to manipulation?

AA BBC journalist tested the system by publishing a deliberately fabricated article. Within 24 hours, Google's AI Overviews began presenting the false information from that article as a factual answer to user queries.

QWhat were Google's main criticisms of the Oumi study's methodology?

AGoogle criticized the study for having 'serious flaws,' stating that the SimpleQA benchmark itself contains inaccuracies, that using Oumi's own AI model to judge another AI could introduce errors, and that the test queries did not reflect real user search behavior.

你可能也喜歡

Bitwise观点:全球债务清算或将使比特币受益

资产管理公司Bitwise发布报告,认为全球债务压力可能最终有利于比特币。报告指出,2026年将有近30万亿美元的全球债务需要再融资,日本国债收益率上升和国际货币基金组织对政府债务需求减弱的警告可能将市场逼入困境。Bitwise认为,若央行因此注入新流动性,作为独立于政府资产负债表、不依赖中央发行机构的资产,比特币可能发挥不同作用。 报告将比特币的吸引力与实际利率挂钩,指出在实际收益率下降时比特币往往表现更好,而顽固通胀与美联储暂停加息可能促成这一环境。比特币在5月一度突破83,000美元后失去动力,回落至70,000美元附近,主要因ETF资金流出加速和市场情绪降温。 Bitwise称,比特币在70,000至73,000美元区间获得支撑,但未能突破80,000至85,000美元的关键阻力带,该区域被视为市场健康与否的分界线。尽管需求疲软,比特币供应正趋于紧张:长期持有者持有的比特币数量创历史新高,占流通供应量的73%,且大量比特币处于长期休眠状态。 报告还指出,与估值接近历史高位的美国科技股相比,比特币的MVRV比率仍低于长期平均水平,显得相对便宜。关键价格位方面,78,000至80,000美元为重要观察区域,83,000至85,000美元是首要阻力位,73,000美元为重要支撑位,上行目标看向95,000美元。截至报告发布时,比特币交易价格约为69,460美元。

bitcoinist1 小時前

Bitwise观点:全球债务清算或将使比特币受益

bitcoinist1 小時前

以太坊基金会主席打破沉默,阐述新使命及内部紧张关系

以太坊基金会主席Aya Miyaguchi阐述了该组织的新使命,称这一转变是内部争论日益紧张、基金会同时面临过多压力后的必要调整。她表示,新使命由董事会提出,但由她于去年年底建议。触发因素并非单一争议,而是结构性问题:EF已成为各种竞争期望的焦点,技术讨论变得政治化、个人化,同时基金会规模扩张导致其核心被多方不同愿景拉扯。 Miyaguchi强调,以太坊基金会只是以太坊众多节点之一,其中心性的减弱并非责任退缩,而是以太坊成熟超越其最初机构的证明。她回顾了自己自2012年以来的行业经历,指出自2018年担任执行董事以来,目标就是帮助以太坊超越基金会发展。基金会通过孵化Uniswap、ENS等项目,支持ETHGlobal黑客松,以及通过Gitcoin等“资助资助者”来刻意分散权力而非保留控制权。 目前,EF持有的ETH已不足总量的0.2%,其角色也按设计变得更集中。新使命的核心是维护和加速使以太坊“具有独特价值、竞争力且值得构建”的特性与目标,聚焦于CROPS及“不可剥夺的用户自我主权和自我主权协调”。Miyaguchi否认更专注的EF意味着减少对应用推广的关心,认为恰恰相反,日常用户和机构都依赖于以太坊的根本价值主张。 此番表态之际,EF在2026年经历了多位高级贡献者的离职潮。Miyaguchi承认,随着基金会变得更加专注和有主见,团队规模自然会变小、更集中,这是选择的一部分。Vitalik Buterin此前也发文描述了基金会向更精简、更专注结构的过渡,减少作为以太坊中心的作用,更注重维护网络的长期特性。

bitcoinist3 小時前

以太坊基金会主席打破沉默,阐述新使命及内部紧张关系

bitcoinist3 小時前

交易

現貨
合約
活动图片