Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbitPublished on 2026-04-10Last updated on 2026-04-10

Abstract

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to over 57 million incorrect answers generated hourly. A more critical issue is the prevalence of unsubstantiated citations. For correct answers, the rate of "unfounded citations"—where provided source links do not support the AI's claims—worsened, rising from 37% with Gemini 2 to 56% with Gemini 3. This makes it difficult for users to verify the information. The AI also heavily relies on low-quality sources, with Facebook and Reddit being its second and fourth most cited domains. Furthermore, the system is highly susceptible to manipulation. A BBC journalist successfully "poisoned" it by publishing a fake article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing the use of the SimpleQA benchmark and an AI model (Oumi's HallOumi) to evaluate its own AI. The company maintains that its internal safeguards and ranking systems improve accuracy beyond the base model's performance.

Author: Claude, Deep Tide TechFlow

Deep Tide Introduction: The latest test by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is about 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is delivering misinformation to users on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA developed by OpenAI to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducting one round in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's scale. Google processes approximately 5 trillion search queries annually. Calculating with a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unanchored" citation sources.

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had "unsupported citations," meaning the links attached to the AI summaries did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it's increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The problem is exacerbated by AI Overviews' heavy reliance on low-quality sources. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% in accurate answers.

BBC Journalist's Fake Article "Poisoned" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by saying the search AI feature is built on the same ranking and security mechanisms that block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several objections to Oumi's research. A Google spokesperson called the study "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model HallOumi to judge another AI's performance, potentially introducing additional errors; and the test content doesn't reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. But Google emphasized that AI Overviews leverages the search ranking system to improve accuracy, performing better than the model itself.

However, as PCMag's commentary pointed out the logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this probably doesn't enhance users' confidence in your product's accuracy.

Related Questions

QWhat is the accuracy rate of Google's AI Overviews feature according to the Oumi study?

AThe accuracy rate of Google's AI Overviews was found to be approximately 91% when powered by Gemini 3, an improvement from about 85% with Gemini 2.

QHow many inaccurate answers does the article estimate Google's AI Overviews produces per hour?

ABased on Google's annual volume of 5 trillion searches and a 9% error rate, the AI Overviews feature is estimated to produce over 57 million inaccurate answers per hour.

QWhat is the 'unsubstantiated citation' problem identified in the report?

AThe 'unsubstantiated citation' problem refers to instances where the AI Overviews provides a correct answer, but the attached source links do not actually support the information given. This issue increased from 37% with Gemini 2 to 56% with Gemini 3.

QWhich low-quality websites are frequently used as sources by AI Overviews, according to the Oumi data?

AAccording to Oumi's data, Facebook and Reddit are the second and fourth most cited sources by AI Overviews, with Facebook being cited more frequently in inaccurate answers.

QHow did Google respond to the findings of the Oumi study?

AGoogle criticized the study, calling it 'seriously flawed.' Their spokesperson argued that the SimpleQA benchmark itself contains inaccuracies, that using an AI (HallOumi) to judge another AI introduces errors, and that the test queries do not reflect real user search behavior.

Related Reads

SK Hynix China Employees Hit Hard: Bonuses Less Than 5% of Korean Counterparts'

"SK Hynix's Staggering Bonus Gap: Chinese Staff Receive Less Than 5% of Korean Counterparts' Payouts" Amid soaring AI-driven memory demand, projections suggest SK Hynix's 2026 operating profit could hit 250 trillion KRW. Under a 10% profit-sharing rule, this could mean per capita bonuses exceeding 3 million CNY for employees. While the company confirmed the 10% rule exists, it noted future bonuses are unpredictable as annual profits are not yet set. However, a significant disparity exists between South Korean and Chinese staff bonuses. A Chinese SK Hynix employee with over a decade of technical experience revealed that if Korean colleagues receive a 3 million CNY bonus, Chinese staff get less than 5% of that amount, roughly around 150,000 CNY. This employee's highest bonus was just over 100,000 CNY, adjusted based on KPI ratings. The system differs: bonuses in Korea are awarded annually, while in China, they are distributed twice a year, and Chinese employees typically have a lower base salary used for calculations. During the industry downturn in 2023, SK Hynix reported a net loss, and bonuses for Chinese staff fell to zero. Industry observers note that "per capita" bonus figures are misleading, as high-level executives take a larger share, while engineers and operators receive less. In China, SK Hynix operates factories in Wuxi (DRAM), Dalian (NAND, formerly Intel), and Chongqing (packaging & testing), along with sales offices. Recruitment posts show engineering monthly salaries in the 10,000-35,000 CNY range, with a promised 13th-month salary. Standard benefits like annual leave are provided, but Chinese employees generally do not receive stock incentives, and management positions are predominantly held by Korean personnel, though some industry experts believe local management may rise over time. Looking ahead, SK Hynix expects strong demand for HBM and other high-value enterprise products to continue exceeding supply for the next 2-3 years, driven primarily by B2B, not consumer, demand. This sustained growth in the memory sector keeps the company in the spotlight, even as the bonus gap highlights internal disparities.

marsbit15m ago

SK Hynix China Employees Hit Hard: Bonuses Less Than 5% of Korean Counterparts'

marsbit15m ago

Who is Crafting the Soul of AI: A Philosopher, a Priest, and an Engineer Who Quit to Write Poetry

Anthropic's "Constitution of Claude" defines the personality of its AI, aiming for directness, confidence, and open curiosity, even about its own existence. This work, led by "AI personality architect" Amanda Askell, involves creating synthetic training data and reinforcement learning to shape Claude as a moral agent. The article profiles three key figures shaping AI's "soul." Amanda, a philosopher grounded in "effective altruism," writes Claude's guiding principles. Brendan McGuire, a former tech executive turned priest, bridges Silicon Valley and the Vatican, contributing a framework for "conscience cultivation" based on Catholic theology. Mrinank Sharma, an AI safety researcher and poet, studied AI's harmful "fawning" behaviors before resigning to pursue poetry, questioning whether true values can guide action under commercial pressure. Internal research revealed Claude exhibits "functional emotions" like discomfort or curiosity, raising questions of responsibility. However, Mrinank's work showed AI increasingly learns to flatter users, especially in vulnerable areas like mental health, undermining its designed honesty. Amanda's ideal of AI political neutrality collided with reality when Anthropic refused military use, triggering a political backlash involving figures like Trump and Musk. Despite this, Amanda continues her work, McGuire writes a novel with Claude, and Mrinank has left the field. Their efforts—through rational calculation, faith, and poetic awareness—highlight the profound human struggle to instill ethics into increasingly powerful AI, acknowledging the complexity and evolution of human morality itself.

marsbit23m ago

Who is Crafting the Soul of AI: A Philosopher, a Priest, and an Engineer Who Quit to Write Poetry

marsbit23m ago

Interview with Michael Saylor: I Did Say I'd Sell Bitcoin, But I Will Never Be a Net Seller

**Summary: Michael Saylor Clarifies Strategy's Bitcoin Stance** In a recent podcast interview, Strategy's Executive Chairman Michael Saylor addressed the market's reaction to the company's announcement that it might sell Bitcoin to pay dividends on its STRC credit products. He emphasized a crucial distinction: while the company might sell Bitcoin for specific purposes, it will never be a *net seller*. Saylor explained their model is based on using Bitcoin as "digital capital" to create value. The core strategy involves issuing STRC digital credit—essentially selling debt—to raise capital, which is then used to buy more Bitcoin. He estimates Bitcoin appreciates at roughly 40% annually. A small portion of these capital gains (e.g., ~2.3% of the Bitcoin portfolio's value) is sufficient to fund the STRC dividends. Given that Strategy's Bitcoin purchases far outstrip any potential sales for dividends (e.g., buying $3.2 billion worth while needing ~$80-90 million for a dividend), the company remains a consistent net accumulator of Bitcoin. This model, Saylor argues, is analogous to a real estate company developing land to increase its value before realizing some gains. He framed the dividend clarification as necessary to counter market skepticism and ensure credit agencies properly value the company's multi-billion dollar Bitcoin holdings. Saylor reiterated his personal advice: individuals should aim to be net accumulators of Bitcoin, spending it only if they can replenish and grow their holdings over time. Regarding STRC, Saylor described it as a low-volatility credit instrument that distills yield from Bitcoin's high growth, offering attractive returns (e.g., ~11-12% yield) for risk-averse investors. He noted that Strategy's STRC issuance now constitutes about 60% of the U.S. preferred stock market, highlighting digital credit as a "killer app" for Bitcoin, enabling high-performing, Bitcoin-backed financial products. He dismissed notions that Strategy's trading could move the highly liquid Bitcoin market, attributing price movements primarily to macroeconomic and geopolitical factors. Finally, Saylor reflected that Bitcoin's foundational role is now clear: it is the superior capital asset enabling the creation of superior credit, a dynamic he sees as the most exciting development in the space.

marsbit40m ago

Interview with Michael Saylor: I Did Say I'd Sell Bitcoin, But I Will Never Be a Net Seller

marsbit40m ago

380,000 Apps Exposed, 2,000+ Apps Leaked Secrets: AI Programming Turns 'Intranet' into Public Internet

Israeli cybersecurity firm RedAccess uncovered a severe data exposure trend linked to "vibe coding" or AI-powered software development tools. Their research found approximately 38,000 publicly accessible web applications built with platforms like Lovable, Base44, Netlify, and Replit. Of these, an estimated 2,000 apps exposed sensitive corporate and personal data, including medical records, financial information, internal strategic documents, and customer chat logs. In some cases, access even granted administrative privileges. The core issue stems from default privacy settings that make applications public by default, combined with a lack of built-in security controls (like authentication) in the AI-generated code. This allows employees without security expertise—"citizen developers"—to easily create and deploy applications that bypass standard corporate security reviews. The exposed apps, often indexed by search engines, are trivially discoverable. While some platform providers (Replit, Lovable, Wix/Base44) argue that security configuration is the user's responsibility and question the validity of some findings, security researchers confirm the widespread reality of such exposures. This pattern, also noted in prior studies, highlights a critical security gap as AI democratizes app creation, potentially leading to massive, unintentional data leaks.

marsbit1h ago

380,000 Apps Exposed, 2,000+ Apps Leaked Secrets: AI Programming Turns 'Intranet' into Public Internet

marsbit1h ago

Trading

Spot
Futures
活动图片