Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbitОпубликовано 2026-04-13Обновлено 2026-04-13

Введение

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (powered by Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to nearly 57 million incorrect answers generated hourly. A critical finding is the prevalence of "unsubstantiated citations." For correct answers, the rate of citations that do not support the AI's summary surged from 37% to 56% with the Gemini 3 upgrade, making it difficult for users to verify information. The AI heavily relies on low-quality sources, with Facebook and Reddit being among its top-cited websites. Furthermore, the system is highly manipulable. A BBC journalist successfully "poisoned" it by publishing a fabricated article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing its use of the SimpleQA benchmark and an AI model (Oumi's own) to evaluate another AI. The company maintains its AI Overviews, combined with its search ranking systems, perform better than the underlying model alone. Critics note this defense does little to bolster user confidence in the feature's reliability.

Author: Claude, Deep Tide TechFlow

Deep Tide Guide: A recent test conducted by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is approximately 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is disseminating misinformation on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA, developed by OpenAI, to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducted in two rounds: one in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's massive scale. Google processes approximately 5 trillion search queries annually. With a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unsubstantiated citations."

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had the problem of "unsubstantiated citations," meaning the links attached to the AI summary did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it is increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The heavy reliance on low-quality sources by AI Overviews exacerbates this problem. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% rate in accurate answers.

BBC Journalist's Fake Article "Poisons" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by stating that the search AI feature is built on the same ranking and security mechanisms used to block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several concerns about Oumi's study. A Google spokesperson called the research "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model, HallOumi, to judge another AI's performance, potentially introducing additional errors; and the test content does not reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. However, Google emphasized that AI Overviews, leveraging the search ranking system, performs better in accuracy than the model alone.

Nevertheless, as PCMag pointed out in a logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this likely does not enhance user confidence in your product's accuracy.

Связанные с этим вопросы

QWhat was the accuracy rate of Google's AI Overviews feature as tested by Oumi, and how many errors does this translate to per hour given Google's search volume?

AThe accuracy rate of Google's AI Overviews was found to be 91% in the test. Given Google's annual volume of 5 trillion searches, this 9% error rate translates to over 57 million inaccurate answers generated every hour.

QAccording to the Oumi study, what was the trend in 'unsubstantiated citations' between the Gemini 2 and Gemini 3 versions of the AI Overviews?

AThe problem of 'unsubstantiated citations' (where the provided links did not support the AI's answer) increased from 37% with Gemini 2 to 56% with the upgraded Gemini 3.

QWhich low-quality websites were identified as major sources frequently cited by Google's AI Overviews?

AFacebook and Reddit were identified as the second and fourth most frequently cited sources by the AI Overviews feature.

QHow did a BBC journalist demonstrate the vulnerability of Google's AI Overviews to manipulation?

AA BBC journalist tested the system by publishing a deliberately fabricated article. Within 24 hours, Google's AI Overviews began presenting the false information from that article as a factual answer to user queries.

QWhat were Google's main criticisms of the Oumi study's methodology?

AGoogle criticized the study for having 'serious flaws,' stating that the SimpleQA benchmark itself contains inaccuracies, that using Oumi's own AI model to judge another AI could introduce errors, and that the test queries did not reflect real user search behavior.

Похожее

This Week's Key Events Preview | U.S. to Release April CPI Data; U.S. Senate Banking Committee to Review "Digital Asset Market Structure Act of 2025"

Weekly News Preview: Key events for May 12-16 include major economic and crypto industry developments. On Tuesday, May 12, the U.S. will release its April CPI data. Additionally, the gaming blockchain Ronin will begin a 10-hour migration to an Ethereum Layer 2, built on OP Stack with EigenDA for data availability. This aims to leverage Ethereum's security and settle RON's annual inflation below 1%. Base's first independent network upgrade, "Base Azul," is scheduled for mainnet activation on Wednesday, May 13, focusing on security, performance, and developer experience enhancements. Thursday, May 14, sees the U.S. Senate Banking Committee voting on the "Digital Asset Market Structure Act of 2025." In other news, Solana DeFi protocol Carrot will shut down, setting a final withdrawal deadline due to impacts from the Drift exploit. The Moscow Exchange will launch futures trading for Solana, Ripple, and Tron indices (RUB-settled) for qualified investors. Multiple service closures are scheduled for Friday, May 15. Dmail Network will begin winding down due to unsustainable infrastructure costs and failed commercialization. Users must export data before this date. Separately, the Cosmos-based lending blockchain UX Chain will fully shut down. Finally, on Saturday, May 16, gaming infrastructure provider Lattice will wind down operations, with its Redstone Layer 2 network ceasing. Users are urged to withdraw assets, especially from contracts like Uniswap pools, before the shutdown.

链捕手56 мин. назад

This Week's Key Events Preview | U.S. to Release April CPI Data; U.S. Senate Banking Committee to Review "Digital Asset Market Structure Act of 2025"

链捕手56 мин. назад

Morning Post | Trump Media Group Releases Q1 Financial Report; Top Three DeFi Applications Return Nearly $100 Million in Revenue to Token Holders in 30 Days; Michael Saylor Shares Bitcoin Tracker Info Again

**Title: Daily Briefing | Trump Media Group Releases Q1 Report; Top 3 DeFi Apps Return Nearly $100M to Token Holders; Michael Saylor Signals Potential Bitcoin Buy** **Summary:** Key developments in the past 24 hours include: * **Economic Outlook:** Goldman Sachs has pushed back its forecast for the next two Federal Reserve interest rate cuts to December 2026 and March 2027, citing persistent inflationary pressures from energy costs. This delayed timeline is expected to tighten liquidity flow into risk assets, including cryptocurrencies. * **DeFi & Revenue:** Data from DefiLlama shows that three leading DeFi applications—Hyperliquid, Pump.fun, and EdgeX—collectively distributed $96.3 million in revenue to their token holders over the last 30 days. This trend highlights a shift in the crypto community's focus towards real protocol earnings and sustainable economic models. * **Corporate Bitcoin Moves:** Michael Saylor, founder of MicroStrategy (note: referred to as 'Strategy' in the text, likely a typographical error), has signaled potential upcoming Bitcoin purchases by posting a "Bitcoin Tracker" update, following a pattern that typically precedes the company's official disclosure of new acquisitions. * **Market Integrity:** Prediction market platform Polymarket announced updates to address platform issues, including identifying and banning clusters of accounts involved in "ghost-fill" activities and implementing measures to prevent bulk account creation. * **Regulation:** The Bank of England Governor warned that stablecoin regulation could lead to tensions between US and international regulators. In South Korea, the National Tax Service has launched a pilot program to entrust seized virtual assets to private custody firms for management. * **Meme Token Trends:** GMGN data lists the top trending meme tokens on Ethereum (e.g., HEX, SHIB), Solana (e.g., FWOG, TROLL), and Base (e.g., SKITTEN, PEPE) over the past day. **Financial Note:** Trump Media & Technology Group reported a Q1 loss of approximately $4 billion, primarily attributed to unrealized losses on its Bitcoin and other digital asset holdings.

链捕手1 ч. назад

Morning Post | Trump Media Group Releases Q1 Financial Report; Top Three DeFi Applications Return Nearly $100 Million in Revenue to Token Holders in 30 Days; Michael Saylor Shares Bitcoin Tracker Info Again

链捕手1 ч. назад

Telegram Takes Direct Control of TON, Social Traffic Rewrites the Public Chain Narrative

Telegram founder Pavel Durov announced that Telegram will replace the TON Foundation as the core driver and largest validator of The Open Network (TON). Key initiatives include a sixfold reduction in transaction fees, performance upgrades, and improved developer tools within the next few weeks. This marks a strategic shift from Telegram merely providing user access to deeply integrating TON into its platform's core infrastructure. The goal is to transform Telegram's massive social traffic into sustainable on-chain activity. While viral mini-apps like Notcoin have demonstrated Telegram's ability to drive user adoption, TON aims to support frequent, low-value transactions inherent to social platforms—such as tipping, in-app payments, and game rewards. Ultra-low fees and sub-second finality (0.6 seconds) are crucial to making blockchain interactions seamless and nearly invisible within the Telegram user experience. However, Telegram's increased central role raises questions about network decentralization. Durov argues that Telegram's participation will attract more large validators, thereby enhancing decentralization. TON also offers high annual staking rewards (18.8%), aiming to retain capital within its ecosystem. The fundamental challenge for TON is no longer leveraging Telegram's user base, but becoming an indispensable, seamless infrastructure layer for Telegram's everyday applications—moving from an adjacent chain to an embedded utility.

marsbit1 ч. назад

Telegram Takes Direct Control of TON, Social Traffic Rewrites the Public Chain Narrative

marsbit1 ч. назад

Торговля

Спот
Фьючерсы
活动图片