Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbit2026-04-10 tarihinde yayınlandı2026-04-10 tarihinde güncellendi

Özet

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to over 57 million incorrect answers generated hourly. A more critical issue is the prevalence of unsubstantiated citations. For correct answers, the rate of "unfounded citations"—where provided source links do not support the AI's claims—worsened, rising from 37% with Gemini 2 to 56% with Gemini 3. This makes it difficult for users to verify the information. The AI also heavily relies on low-quality sources, with Facebook and Reddit being its second and fourth most cited domains. Furthermore, the system is highly susceptible to manipulation. A BBC journalist successfully "poisoned" it by publishing a fake article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing the use of the SimpleQA benchmark and an AI model (Oumi's HallOumi) to evaluate its own AI. The company maintains that its internal safeguards and ranking systems improve accuracy beyond the base model's performance.

Author: Claude, Deep Tide TechFlow

Deep Tide Introduction: The latest test by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is about 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is delivering misinformation to users on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA developed by OpenAI to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducting one round in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's scale. Google processes approximately 5 trillion search queries annually. Calculating with a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unanchored" citation sources.

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had "unsupported citations," meaning the links attached to the AI summaries did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it's increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The problem is exacerbated by AI Overviews' heavy reliance on low-quality sources. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% in accurate answers.

BBC Journalist's Fake Article "Poisoned" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by saying the search AI feature is built on the same ranking and security mechanisms that block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several objections to Oumi's research. A Google spokesperson called the study "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model HallOumi to judge another AI's performance, potentially introducing additional errors; and the test content doesn't reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. But Google emphasized that AI Overviews leverages the search ranking system to improve accuracy, performing better than the model itself.

However, as PCMag's commentary pointed out the logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this probably doesn't enhance users' confidence in your product's accuracy.

İlgili Sorular

QWhat is the accuracy rate of Google's AI Overviews feature according to the Oumi study?

AThe accuracy rate of Google's AI Overviews was found to be approximately 91% when powered by Gemini 3, an improvement from about 85% with Gemini 2.

QHow many inaccurate answers does the article estimate Google's AI Overviews produces per hour?

ABased on Google's annual volume of 5 trillion searches and a 9% error rate, the AI Overviews feature is estimated to produce over 57 million inaccurate answers per hour.

QWhat is the 'unsubstantiated citation' problem identified in the report?

AThe 'unsubstantiated citation' problem refers to instances where the AI Overviews provides a correct answer, but the attached source links do not actually support the information given. This issue increased from 37% with Gemini 2 to 56% with Gemini 3.

QWhich low-quality websites are frequently used as sources by AI Overviews, according to the Oumi data?

AAccording to Oumi's data, Facebook and Reddit are the second and fourth most cited sources by AI Overviews, with Facebook being cited more frequently in inaccurate answers.

QHow did Google respond to the findings of the Oumi study?

AGoogle criticized the study, calling it 'seriously flawed.' Their spokesperson argued that the SimpleQA benchmark itself contains inaccuracies, that using an AI (HallOumi) to judge another AI introduces errors, and that the test queries do not reflect real user search behavior.

İlgili Okumalar

The Essence of AI Layoffs: Why More AI Adoption Leads to More Corporate Anxiety?

The author, awaiting potential inclusion on an 8000-person layoff list, analyzes the true nature of recent "AI-driven" layoffs. They argue that while AI use, particularly tools like Claude for code generation, has skyrocketed and boosted developer output (e.g., 2-5x more code commits), this has not translated into proportional business growth or revenue. The core issue is a misalignment between increased "Input" (code) and tangible "Outcomes" (user value, revenue). AI acts as a costly B2B SaaS, inflating operational expenses without guaranteed returns. Two key problems emerge: 1) The friction that once filtered out bad ideas is gone, as AI allows cheap pursuit of even weak concepts. 2) Organizational "alignment tax"—the difficulty of coordinating across teams—becomes crippling when development velocity outpaces consensus-building. Thus, layoffs serve two immediate purposes: 1) To offset ballooning AI costs (Token consumption) and maintain cash flow, as rising input costs without outcome growth destroys unit economics. 2) To reduce organizational bloat and alignment friction by simply removing teams, thereby speeding up execution in the short term. Therefore, these layoffs are fundamentally caused by AI, even if AI doesn't directly replace roles. They represent a painful correction until companies learn to convert AI-driven productivity into real business outcomes and streamline organizational coordination to match the new pace of work. The cycle will continue until this learning curve is mastered.

marsbit33 dk önce

The Essence of AI Layoffs: Why More AI Adoption Leads to More Corporate Anxiety?

marsbit33 dk önce

Can the Solana Foundation and Google's Collaboration on Pay.sh Bridge the Payment Link Between Web2 and Web3 in the Agent Economy?

Solana Foundation, in collaboration with Google Cloud, has launched Pay.sh, a payment gateway designed to bridge the gap between AI agents and enterprise-grade service infrastructure. The initiative aims to solve a key bottleneck in the "agent economy": existing payment systems are ill-suited for autonomous AI agents. Traditional methods like credit cards require human verification, while newer on-chain protocols like x402 and MPP create a separate, Web3-native system that raises barriers for service providers. Pay.sh functions as a universal payment layer. It allows users to fund a Solana wallet via credit card or stablecoin, which then acts as an identity and payment proxy for AI agents. When an agent needs to access a paid API service (e.g., Google Cloud, Alibaba Cloud), Pay.sh handles the transaction seamlessly. It leverages the HTTP 402 status code ("Payment Required") to initiate payments, intelligently choosing between one-time transfers (x402-style) or session-based authorizations (MPC-style) based on the service's billing model. This spares agents from manual account registration and API key management. A key feature for service providers is low integration effort. They can adopt Pay.sh by providing a declarative configuration file, enabling features like tiered pricing, free tiers, and automatic revenue splitting to multiple addresses (e.g., for royalties, cloud costs). Providers can also list their APIs in a central Pay Skill Registry for agent discovery. The collaboration with Google Cloud provides crucial infrastructure for API proxying, traffic routing, and compliance logging, aiming to keep agent activities within regulated boundaries. By connecting Web2 services with Web3 payment rails, Pay.sh positions the Solana wallet as a foundational identity and payment tool for AI agents, potentially driving more transaction volume to the Solana ecosystem. However, the report notes challenges. The service registry currently lacks robust vetting, risking exposure to unauthorized or malicious third-party APIs. Pay.sh also inherits security and compatibility risks from its underlying payment protocols (x402, MPC). Furthermore, adoption may be hindered by varying regional data privacy and payment compliance regulations among API providers. Despite these hurdles, Pay.sh represents a significant step towards integrating Web2 and Web3 for autonomous agent commerce.

marsbit39 dk önce

Can the Solana Foundation and Google's Collaboration on Pay.sh Bridge the Payment Link Between Web2 and Web3 in the Agent Economy?

marsbit39 dk önce

Bitcoin's Bull-Bear Cycle Indicator Turns Positive for the First Time in 7 Months: End of Bear Market or False Breakout?

Bitcoin's "Bull-Bear Market Cycle Indicator" from CryptoQuant has turned positive for the first time since October 2025. This gauge, based on the P&L Index relative to its 365-day moving average, suggests a potential shift from a bear market phase. Concurrently, the Bull Score Index rose to a neutral reading of 50 in late April. The indicator's move into positive territory follows a roughly 35% price rebound from a low near $60,000 in February to above $81,000. The recovery over approximately three months was faster than the 12-month period observed during the 2022 bear market. However, analysts caution against premature optimism, citing a historical precedent from March 2022. Back then, the Bull Score Index briefly hit 50, but it proved to be a false signal as Bitcoin's price subsequently plunged further. Structural differences exist in the current cycle, including consistent inflows into spot Bitcoin ETFs and an increase in large holder addresses. Yet, some models, referencing the four-year halving cycle, suggest a potential deeper bottom near $50,000 might still be possible around late 2026. In summary, while on-chain data shows marked improvement and the worst panic may be over, market participants remain cautious. A convincing trend reversal confirmation likely requires Bitcoin to sustainably break above key resistance, such as the 200-day moving average near $82,000.

marsbit47 dk önce

Bitcoin's Bull-Bear Cycle Indicator Turns Positive for the First Time in 7 Months: End of Bear Market or False Breakout?

marsbit47 dk önce

How to Automate Any Workflow with Claude Skills (Complete Tutorial)

This is a comprehensive guide to mastering Claude Skills, a feature for creating permanent, reusable instruction sets that automate specific workflows. Unlike simple saved prompts, Skills function like trained employees, delivering consistent, high-quality outputs by defining the entire task process, standards, error handling, and output format. The guide is structured in four phases: **Phase 1: Installation (5 minutes).** Skills are folders containing a `SKILL.md` file. The user is instructed to find a relevant Skill online, install it, test it on a real task, and compare its performance to one-off prompts. **Phase 2: Building Your First Custom Skill.** Start by rigorously defining the Skill's purpose, trigger phrases, and providing a concrete example of perfect output. The `SKILL.md` file has two parts: a YAML frontmatter with a specific name/description/triggers, and a detailed, step-by-step workflow written in natural language with examples and quality standards. **Phase 3: Testing & Optimization for Production.** Test the Skill in three scenarios: 1) a standard, common task; 2) edge cases with missing or conflicting data; and 3) a pressure test with maximum complexity. Any failure indicates a needed instruction. Implement a weekly optimization cycle to continuously refine the Skill based on real usage. **Phase 4: Building a Complete Skill Library.** The goal is to create a team of Skills for all repetitive tasks. Examples are given for industries like real estate, marketing, finance, consulting, and e-commerce. The user should list their tasks, prioritize them, and build one new Skill per week, maintaining a master document to track their library. The conclusion emphasizes the compounding time savings: ten Skills saving 30 minutes each per week reclaims over 260 hours (6.5 work weeks) per year, fundamentally transforming one's work system.

marsbit1 saat önce

How to Automate Any Workflow with Claude Skills (Complete Tutorial)

marsbit1 saat önce

İşlemler

Spot
Futures
活动图片