Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbitPublished on 2026-04-13Last updated on 2026-04-13

Abstract

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (powered by Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to nearly 57 million incorrect answers generated hourly. A critical finding is the prevalence of "unsubstantiated citations." For correct answers, the rate of citations that do not support the AI's summary surged from 37% to 56% with the Gemini 3 upgrade, making it difficult for users to verify information. The AI heavily relies on low-quality sources, with Facebook and Reddit being among its top-cited websites. Furthermore, the system is highly manipulable. A BBC journalist successfully "poisoned" it by publishing a fabricated article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing its use of the SimpleQA benchmark and an AI model (Oumi's own) to evaluate another AI. The company maintains its AI Overviews, combined with its search ranking systems, perform better than the underlying model alone. Critics note this defense does little to bolster user confidence in the feature's reliability.

Author: Claude, Deep Tide TechFlow

Deep Tide Guide: A recent test conducted by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is approximately 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is disseminating misinformation on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA, developed by OpenAI, to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducted in two rounds: one in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's massive scale. Google processes approximately 5 trillion search queries annually. With a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unsubstantiated citations."

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had the problem of "unsubstantiated citations," meaning the links attached to the AI summary did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it is increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The heavy reliance on low-quality sources by AI Overviews exacerbates this problem. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% rate in accurate answers.

BBC Journalist's Fake Article "Poisons" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by stating that the search AI feature is built on the same ranking and security mechanisms used to block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several concerns about Oumi's study. A Google spokesperson called the research "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model, HallOumi, to judge another AI's performance, potentially introducing additional errors; and the test content does not reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. However, Google emphasized that AI Overviews, leveraging the search ranking system, performs better in accuracy than the model alone.

Nevertheless, as PCMag pointed out in a logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this likely does not enhance user confidence in your product's accuracy.

Related Questions

QWhat was the accuracy rate of Google's AI Overviews feature as tested by Oumi, and how many errors does this translate to per hour given Google's search volume?

AThe accuracy rate of Google's AI Overviews was found to be 91% in the test. Given Google's annual volume of 5 trillion searches, this 9% error rate translates to over 57 million inaccurate answers generated every hour.

QAccording to the Oumi study, what was the trend in 'unsubstantiated citations' between the Gemini 2 and Gemini 3 versions of the AI Overviews?

AThe problem of 'unsubstantiated citations' (where the provided links did not support the AI's answer) increased from 37% with Gemini 2 to 56% with the upgraded Gemini 3.

QWhich low-quality websites were identified as major sources frequently cited by Google's AI Overviews?

AFacebook and Reddit were identified as the second and fourth most frequently cited sources by the AI Overviews feature.

QHow did a BBC journalist demonstrate the vulnerability of Google's AI Overviews to manipulation?

AA BBC journalist tested the system by publishing a deliberately fabricated article. Within 24 hours, Google's AI Overviews began presenting the false information from that article as a factual answer to user queries.

QWhat were Google's main criticisms of the Oumi study's methodology?

AGoogle criticized the study for having 'serious flaws,' stating that the SimpleQA benchmark itself contains inaccuracies, that using Oumi's own AI model to judge another AI could introduce errors, and that the test queries did not reflect real user search behavior.

Related Reads

North Korean Hackers Loot $500 Million in a Single Month, Becoming the Top Threat to Crypto Security

North Korean hackers, particularly the notorious Lazarus Group and its subgroup TraderTraitor, have stolen over $500 million from cryptocurrency DeFi platforms in less than three weeks, bringing their total theft for the year to over $700 million. Recent major attacks on Drift Protocol and KelpDAO, resulting in losses of approximately $286 million and $290 million respectively, highlight a strategic shift: instead of targeting core smart contracts, attackers are now exploiting vulnerabilities in peripheral infrastructure. For instance, the KelpDAO attack involved compromising downstream RPC infrastructure used by LayerZero's decentralized validation network (DVN), allowing manipulation without breaching core cryptography. This sophisticated approach mirrors advanced corporate cyber-espionage. Additionally, North Korea has systematically infiltrated the global crypto workforce, with an estimated 100 operatives using fake identities to gain employment at blockchain companies, enabling long-term access to sensitive systems and facilitating large-scale thefts. According to Chainalysis, North Korean-linked hackers stole a record $2 billion in 2025, accounting for 60% of all global crypto theft that year. Their total historical crypto theft has reached $6.75 billion. Post-theft, they employ specialized money laundering methods, heavily relying on Chinese OTC brokers and cross-chain mixing services rather than standard decentralized exchanges. Security experts, while acknowledging the increased sophistication, emphasize that many attacks still exploit fundamental weaknesses like poor access controls and centralized operational risks. Strengthening private key management, limiting privileged access, and enhancing coordination among exchanges, analysts, and law enforcement immediately after an attack are critical to improving defense and fund recovery chances. The industry's challenge now extends beyond secure smart contracts to safeguarding operational security at the infrastructure level.

marsbit34m ago

North Korean Hackers Loot $500 Million in a Single Month, Becoming the Top Threat to Crypto Security

marsbit34m ago

Circle CEO's Seoul Visit: No Korean Won Stablecoin Issuance, But Met All Major Korean Banks

Circle CEO Jeremy Allaire's recent activities in Seoul indicate a strategic shift for the company, moving away from issuing a Korean won-backed stablecoin and instead focusing on embedding itself as a key infrastructure provider within Korea’s financial and crypto ecosystem. Despite Korea accounting for nearly 30% of global crypto trading volume—with a market characterized by high retail participation and altcoin dominance—Circle has chosen not to compete for the role of stablecoin issuer. Instead, Allaire met with major Korean banks (including Shinhan, KB, and Woori), financial groups, leading exchanges (Upbit, Bithumb, Coinone), and tech firms like Kakao. This approach reflects a broader industry transition: the core of stablecoin competition is shifting from issuance rights to systemic positioning. With Korean regulators still debating whether banks or tech companies should issue stablecoins, Circle is avoiding regulatory uncertainty by strengthening its role as a service and technology partner. The company is deepening integration with trading platforms, building connections, and promoting stablecoin infrastructure. This positions Circle to benefit regardless of which entity eventually issues a won stablecoin. Allaire also noted the potential for a Chinese yuan stablecoin in the next 3–5 years, underscoring a regional trend of stablecoins becoming more regulated and integrated with traditional finance. Ultimately, Circle’s strategy highlights that future influence in the stablecoin market will belong not necessarily to the issuers, but to the foundational infrastructure layers that enable cross-system transactions.

marsbit1h ago

Circle CEO's Seoul Visit: No Korean Won Stablecoin Issuance, But Met All Major Korean Banks

marsbit1h ago

SpaceX Ties Up with Cursor: A High-Stakes AI Gambit of 'Lock First, Acquire Later'

SpaceX has secured an option to acquire AI programming company Cursor for $60 billion, with an alternative clause requiring a $10 billion collaboration fee if the acquisition does not proceed. This structure is not merely a potential acquisition but a strategic move to control core access points in the AI era. The deal is designed as a flexible, dual-path arrangement, allowing SpaceX to either fully acquire Cursor or maintain a binding partnership through high-cost collaboration. This "option-style" approach minimizes immediate regulatory and integration risks while ensuring long-term alignment between the two companies. At its core, the transaction exchanges critical AI-era resources: SpaceX provides its Colossus supercomputing cluster—one of the world’s most powerful AI training infrastructures—while Cursor contributes its AI-native developer environment and strong product adoption. This synergy connects compute power, models, and application layers, forming a closed-loop AI capability stack. Cursor, founded in 2022, has achieved rapid growth with over $1 billion in annual revenue and widespread enterprise adoption. Its value lies in transforming software development through AI agents capable of coding, debugging, and system design—positioning it as a gateway to future software production. For SpaceX, this move is part of a broader strategy to evolve from a aerospace company into an AI infrastructure empire, integrating xAI, supercomputing, and chip manufacturing. Controlling Cursor fills a gap in its developer tooling layer, strengthening its AI narrative ahead of a potential IPO. The deal reflects a shift in AI competition from model superiority to ecosystem and entry-point control. With programming tools as a key battleground, securing developer loyalty becomes crucial for dominating the software production landscape. Risks include questions around Cursor’s valuation, technical integration challenges, and potential regulatory scrutiny. Nevertheless, the deal underscores a strategic bet: controlling both compute and software development access may redefine power dynamics in the AI-driven future.

marsbit1h ago

SpaceX Ties Up with Cursor: A High-Stakes AI Gambit of 'Lock First, Acquire Later'

marsbit1h ago

Trading

Spot
Futures
活动图片