Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbitPublicado em 2026-04-10Última atualização em 2026-04-10

Resumo

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to over 57 million incorrect answers generated hourly. A more critical issue is the prevalence of unsubstantiated citations. For correct answers, the rate of "unfounded citations"—where provided source links do not support the AI's claims—worsened, rising from 37% with Gemini 2 to 56% with Gemini 3. This makes it difficult for users to verify the information. The AI also heavily relies on low-quality sources, with Facebook and Reddit being its second and fourth most cited domains. Furthermore, the system is highly susceptible to manipulation. A BBC journalist successfully "poisoned" it by publishing a fake article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing the use of the SimpleQA benchmark and an AI model (Oumi's HallOumi) to evaluate its own AI. The company maintains that its internal safeguards and ranking systems improve accuracy beyond the base model's performance.

Author: Claude, Deep Tide TechFlow

Deep Tide Introduction: The latest test by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is about 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is delivering misinformation to users on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA developed by OpenAI to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducting one round in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's scale. Google processes approximately 5 trillion search queries annually. Calculating with a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unanchored" citation sources.

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had "unsupported citations," meaning the links attached to the AI summaries did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it's increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The problem is exacerbated by AI Overviews' heavy reliance on low-quality sources. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% in accurate answers.

BBC Journalist's Fake Article "Poisoned" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by saying the search AI feature is built on the same ranking and security mechanisms that block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several objections to Oumi's research. A Google spokesperson called the study "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model HallOumi to judge another AI's performance, potentially introducing additional errors; and the test content doesn't reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. But Google emphasized that AI Overviews leverages the search ranking system to improve accuracy, performing better than the model itself.

However, as PCMag's commentary pointed out the logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this probably doesn't enhance users' confidence in your product's accuracy.

Perguntas relacionadas

QWhat is the accuracy rate of Google's AI Overviews feature according to the Oumi study?

AThe accuracy rate of Google's AI Overviews was found to be approximately 91% when powered by Gemini 3, an improvement from about 85% with Gemini 2.

QHow many inaccurate answers does the article estimate Google's AI Overviews produces per hour?

ABased on Google's annual volume of 5 trillion searches and a 9% error rate, the AI Overviews feature is estimated to produce over 57 million inaccurate answers per hour.

QWhat is the 'unsubstantiated citation' problem identified in the report?

AThe 'unsubstantiated citation' problem refers to instances where the AI Overviews provides a correct answer, but the attached source links do not actually support the information given. This issue increased from 37% with Gemini 2 to 56% with Gemini 3.

QWhich low-quality websites are frequently used as sources by AI Overviews, according to the Oumi data?

AAccording to Oumi's data, Facebook and Reddit are the second and fourth most cited sources by AI Overviews, with Facebook being cited more frequently in inaccurate answers.

QHow did Google respond to the findings of the Oumi study?

AGoogle criticized the study, calling it 'seriously flawed.' Their spokesperson argued that the SimpleQA benchmark itself contains inaccuracies, that using an AI (HallOumi) to judge another AI introduces errors, and that the test queries do not reflect real user search behavior.

Leituras Relacionadas

Breaking: OpenAI Undergoes Major Reorganization, President Brockman Assumes Command

OpenAI has announced a major internal reorganization just months before its anticipated IPO. The company is merging its three flagship product lines—ChatGPT, Codex, and the API platform—into a single, unified product organization. The most significant leadership change involves co-founder and President Greg Brockman moving from a background technical role to take full, permanent control over all product strategy. This follows the indefinite medical leave of AGI Deployment CEO Fidji Simo. Additionally, ChatGPT's longtime lead, Nick Turley, has been reassigned to enterprise products, with former Instagram executive Ashley Alexander taking over consumer offerings. The consolidation, internally framed as a strategic move towards an "Agentic Future," aims to break down internal silos and create a cohesive "Super App." This planned desktop application would integrate ChatGPT's conversational abilities, Codex's coding power, and a rumored internal web browser named "Atlas" to autonomously perform complex user tasks. The reorganization occurs amid significant internal and external pressures. OpenAI has recently seen a wave of high-profile departures, including Sora co-lead Bill Peebles and other senior technical leaders, leading to concerns about a thinning executive bench. Externally, rival Anthropic recently secured funding at a staggering $900 billion valuation, surpassing OpenAI's own. Google's upcoming I/O developer conference also poses a competitive threat. Analysts suggest the dramatic restructure is a pre-IPO move to present a clearer, more focused narrative to Wall Street—streamlining operations and demonstrating decisive leadership under Brockman to counter internal turbulence and intense market competition.

marsbitHá 3h

Breaking: OpenAI Undergoes Major Reorganization, President Brockman Assumes Command

marsbitHá 3h

Two Survival Structures of Market Makers and Arbitrageurs

Market makers and arbitrageurs represent two distinct survival structures in high-frequency trading. Market makers primarily use limit orders (makers) to profit from the bid-ask spread, enjoying high capital efficiency (nominally 100%) but bearing inventory risk. This "inventory risk" arises from passive, fragmented, and discontinuous order fills in the limit order book (LOB). This risk, while a potential cost, can also contribute to excess profit if managed within control boundaries, allowing for mean reversion. Market makers essentially sell "time" (uncertainty over execution timing) to the market for price control and low fees. In contrast, cross-exchange arbitrageurs typically use market orders (takers) to exploit price differences or funding rates, resulting in lower nominal capital efficiency (requiring capital on both exchanges) and higher transaction costs. Their risk exposure stems from asymmetries in exchange rules (e.g., minimum order sizes), execution latency, and infrastructure risks (e.g., ADL, oracle drift). These exposures are active, exogenous gaps that primarily erode profits rather than contribute to them. Arbitrageurs essentially sell "space" (capital sunk across venues) for localized, immediate certainty. Both strategies engage in a trade-off between execution friction and residual risk. Optimal systems allow for temporary, controlled risk exposure rather than enforcing zero exposure at all costs. Their evolution converges towards hybrid models: arbitrageurs may use maker orders to reduce costs, while market makers may use taker orders or hedges for risk management. Ultimately, both use different forms of risk exposure—market makers exposing inventory, arbitrageurs immobilizing capital—to extract marginal, hard-won certainty from the market.

链捕手Há 3h

Two Survival Structures of Market Makers and Arbitrageurs

链捕手Há 3h

Who Will Define the Rules of the AI Era? Anthropic Discusses the 2028 US-China AI Landscape

This article, based on Anthropic's analysis, outlines the intensifying systemic competition between the U.S./allies and China for AI leadership by 2028. It argues that access to advanced computing power ("compute") is the critical bottleneck, where the U.S. currently holds a significant advantage through chip export controls and allied innovation. However, China's AI labs remain competitive by exploiting policy loopholes—via chip smuggling, overseas data center access, and "model distillation" attacks to copy U.S. model capabilities—keeping them close to the frontier. The piece presents two contrasting scenarios for 2028. In the first, decisive U.S. action to tighten compute controls and curb distillation locks in a 12-24 month AI capability lead, cementing democratic influence over global AI norms, security, and economic infrastructure. In the second, policy inaction allows China to achieve near-parity through continued access to U.S. technology, enabling Beijing to promote its AI stack globally and integrate advanced AI into its military and governance systems, altering the strategic balance. Anthropic contends that maintaining a decisive U.S. lead is essential for shaping safe AI development and governance. The core recommendation is for U.S. policymakers to urgently close compute and model access loopholes while promoting global adoption of the U.S. AI technology stack to secure a lasting strategic advantage.

marsbitHá 5h

Who Will Define the Rules of the AI Era? Anthropic Discusses the 2028 US-China AI Landscape

marsbitHá 5h

Trading

Spot
Futuros
活动图片