Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbitPublicado em 2026-04-10Última atualização em 2026-04-10

Resumo

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to over 57 million incorrect answers generated hourly. A more critical issue is the prevalence of unsubstantiated citations. For correct answers, the rate of "unfounded citations"—where provided source links do not support the AI's claims—worsened, rising from 37% with Gemini 2 to 56% with Gemini 3. This makes it difficult for users to verify the information. The AI also heavily relies on low-quality sources, with Facebook and Reddit being its second and fourth most cited domains. Furthermore, the system is highly susceptible to manipulation. A BBC journalist successfully "poisoned" it by publishing a fake article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing the use of the SimpleQA benchmark and an AI model (Oumi's HallOumi) to evaluate its own AI. The company maintains that its internal safeguards and ranking systems improve accuracy beyond the base model's performance.

Author: Claude, Deep Tide TechFlow

Deep Tide Introduction: The latest test by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is about 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is delivering misinformation to users on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA developed by OpenAI to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducting one round in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's scale. Google processes approximately 5 trillion search queries annually. Calculating with a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unanchored" citation sources.

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had "unsupported citations," meaning the links attached to the AI summaries did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it's increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The problem is exacerbated by AI Overviews' heavy reliance on low-quality sources. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% in accurate answers.

BBC Journalist's Fake Article "Poisoned" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by saying the search AI feature is built on the same ranking and security mechanisms that block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several objections to Oumi's research. A Google spokesperson called the study "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model HallOumi to judge another AI's performance, potentially introducing additional errors; and the test content doesn't reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. But Google emphasized that AI Overviews leverages the search ranking system to improve accuracy, performing better than the model itself.

However, as PCMag's commentary pointed out the logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this probably doesn't enhance users' confidence in your product's accuracy.

Perguntas relacionadas

QWhat is the accuracy rate of Google's AI Overviews feature according to the Oumi study?

AThe accuracy rate of Google's AI Overviews was found to be approximately 91% when powered by Gemini 3, an improvement from about 85% with Gemini 2.

QHow many inaccurate answers does the article estimate Google's AI Overviews produces per hour?

ABased on Google's annual volume of 5 trillion searches and a 9% error rate, the AI Overviews feature is estimated to produce over 57 million inaccurate answers per hour.

QWhat is the 'unsubstantiated citation' problem identified in the report?

AThe 'unsubstantiated citation' problem refers to instances where the AI Overviews provides a correct answer, but the attached source links do not actually support the information given. This issue increased from 37% with Gemini 2 to 56% with Gemini 3.

QWhich low-quality websites are frequently used as sources by AI Overviews, according to the Oumi data?

AAccording to Oumi's data, Facebook and Reddit are the second and fourth most cited sources by AI Overviews, with Facebook being cited more frequently in inaccurate answers.

QHow did Google respond to the findings of the Oumi study?

AGoogle criticized the study, calling it 'seriously flawed.' Their spokesperson argued that the SimpleQA benchmark itself contains inaccuracies, that using an AI (HallOumi) to judge another AI introduces errors, and that the test queries do not reflect real user search behavior.

Leituras Relacionadas

OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

OpenAI engineer Weng Jiayi's "Heuristic Learning" experiments propose a new paradigm for Agentic AI, suggesting that intelligent agents can improve not just by training neural networks, but also by autonomously writing and refining code based on environmental feedback. In the experiment, a coding agent (powered by Codex) was tasked with developing and maintaining a programmatic strategy for the Atari game Breakout. Starting from a basic prompt, the agent iteratively wrote code, ran the game, analyzed logs and video replays to identify failures, and then modified the code. Through this engineering loop of "code-run-debug-update," it evolved a pure Python heuristic strategy that achieved a perfect score of 864 in Breakout and performed competitively with deep reinforcement learning (RL) algorithms in MuJoCo control tasks like Ant and HalfCheetah. This approach, termed Heuristic Learning (HL), contrasts with Deep RL. In HL, experience is captured in readable, modifiable code, tests, logs, and configurations—a software system—rather than being encoded solely into opaque neural network weights. This offers potential advantages in explainability, auditability for safety-critical applications, easier integration of regression tests to combat catastrophic forgetting, and more efficient sample use in early learning stages, as demonstrated in broader tests on 57 Atari games. However, the blog acknowledges clear limitations. Programmatic strategies struggle with tasks requiring long-horizon planning or complex perception (e.g., Montezuma's Revenge), areas where neural networks excel. The future vision is a hybrid architecture: specialized neural networks for fast perception (System 1), HL systems for rules, safety, and local recovery (also System 1), and LLM agents providing high-level feedback and learning from the HL system's data (System 2). The core proposition is that in the era of capable coding agents, a significant portion of an AI's learned experience could be maintained as an auditable, evolving software system.

marsbitHá 26m

OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

marsbitHá 26m

Your Claude Will Dream Tonight, Don't Disturb It

This article explores the recent phenomenon of AI companies increasingly using anthropomorphic language—like "thinking," "memory," "hallucination," and now "dreaming"—to describe machine learning processes. Focusing on Anthropic's newly announced "Dreaming" feature for its Claude Agent platform, the piece explains that this function is essentially an automated, offline batch processing of an agent's operational logs. It analyzes past task sessions to identify patterns, optimize future actions, and consolidate learnings into a persistent memory system, akin to a form of reinforcement learning and self-correction. The article draws parallels to similar features in other AI agent systems like Hermes Agent and OpenClaw, which also implement mechanisms for reviewing historical data, extracting reusable "skills," and strengthening long-term memory. It notes a key difference from human dreaming: these AI "dreams" still consume computational resources and user tokens. Further context is provided by discussing the technical challenges of managing AI "memory" or context, highlighting the computational expense of large context windows and innovations like Subquadratic's new model claiming drastically longer contexts. The core critique argues that this strategic use of human-centric vocabulary does more than market products; it subtly reshapes user perception. By framing algorithms with terms associated with consciousness, companies blur the line between tool and autonomous entity. This linguistic shift can influence user expectations, tolerance for errors, and even perceptions of responsibility when systems fail, potentially diverting scrutiny from the companies and engineers behind the technology. The article concludes by speculating that terms like "daydreaming" for predictive task simulation might be next, continuing this trend of embedding the idea of an "inner life" into computational processes.

marsbitHá 28m

Your Claude Will Dream Tonight, Don't Disturb It

marsbitHá 28m

Duan Yongping's Bottom-Fishing in CoreWeave Is Turning into a Battleground for Bulls and Bears

CoreWeave's Q1 2026 earnings report has intensified the ongoing bull-bear battle over the AI infrastructure stock. While revenue doubled year-over-year to $2.08B and the firm's remaining performance obligation (RPO) surged to nearly $100B, its net loss more than doubled to $740M. The critical point of contention is profitability: while adjusted EBITDA margin was a robust 56%, the adjusted operating margin collapsed to just 1% due to soaring infrastructure and sales costs. A weaker-than-expected Q2 revenue guidance further triggered an 11.4% single-day stock drop. The bull thesis hinges on CoreWeave's massive order backlog, deep strategic ties with NVIDIA (as a customer, investor, and key supplier), and a diversified client base now including Anthropic and Meta. The bear case focuses on the "scale at all costs" model, where expanding revenue leads to wider losses, ballooning debt ($25B), and massive capital expenditures ($6.8B in Q1). Insider selling by executives contrasts with a notable new investor: Chinese investor Duan Yongping initiated a small position (0.12% of his portfolio) in Q4 2025 near the stock's lows. The coming Q2 report is seen as a key test for management's promise of a profit margin recovery.

marsbitHá 34m

Duan Yongping's Bottom-Fishing in CoreWeave Is Turning into a Battleground for Bulls and Bears

marsbitHá 34m

China's Version of 'Tech Burning Man' Debuts in Shanghai, muShanghai Creates a Global Geek 'Pop-up City'

From May 10 to June 6, 2026, the inaugural muShanghai "Pop-up City" experiment, dubbed China's version of a "tech Burning Man," launched in Shanghai. Co-organized by the international open-source community The Mu and the Hongqiao Alibaba Center, this 28-day event aimed to build a global "parallel city" for geeks, attracting over 800 participants from more than 50 countries, including former OpenAI engineers and startup founders. The program featured four themed weeks: AI Week, Biotech Week, Robotics Week, and Culture Week, hosting nearly 100 sessions like ClawCon 2026. Activities ranged from discussions on AI safety and consumer apps to robot battles and cyberpunk culture, culminating in large outdoor "Innovator Marketplaces." A core principle was "Build in Public," encouraging open sharing of projects and progress. Hongqiao Alibaba Center served as the co-host and primary venue, positioning itself as a first-stop hub for international talent in China. The event marks a significant step for The Mu community, which has previously organized similar pop-up cities in Argentina and San Francisco, in bringing its model of immersive, collaborative innovation to China. It aims to be a key window for global科创 (scientific and technological innovation) exchange.

marsbitHá 34m

China's Version of 'Tech Burning Man' Debuts in Shanghai, muShanghai Creates a Global Geek 'Pop-up City'

marsbitHá 34m

CLARITY Act: Banking Trade Groups Push For Yield Agreement Revision – Details

US banking trade groups are urging revisions to the stablecoin yield compromise in the upcoming CLARITY Act ahead of a key committee markup. The Act currently aims to ban all passive, deposit-like interest on stablecoins to prevent competition with traditional bank savings, while allowing rewards tied to active uses like staking or transactions. In a letter, groups including the American Banking Association and Bank Policy Institute proposed stricter language to eliminate perceived loopholes for passive yield and prevent deposit flight from banks. However, these efforts are reportedly viewed as minor by some lawmakers. The Senate Banking Committee is scheduled to mark up the bill on May 14, a critical step before it can advance through Congress.

bitcoinistHá 14h

CLARITY Act: Banking Trade Groups Push For Yield Agreement Revision – Details

bitcoinistHá 14h

Trading

Spot

Futuros

Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

Resumo

Correct Answers, Wrong Sources

BBC Journalist's Fake Article "Poisoned" Results Within 24 Hours

Google's Rebuttal: The Test Itself Is Flawed

Perguntas relacionadas

Leituras Relacionadas

OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

Your Claude Will Dream Tonight, Don't Disturb It

Duan Yongping's Bottom-Fishing in CoreWeave Is Turning into a Battleground for Bulls and Bears

China's Version of 'Tech Burning Man' Debuts in Shanghai, muShanghai Creates a Global Geek 'Pop-up City'

CLARITY Act: Banking Trade Groups Push For Yield Agreement Revision – Details

Trading

Categorias populares

Etiquetas Populares