Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbitPublicado a 2026-04-13Actualizado a 2026-04-13

Resumen

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (powered by Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to nearly 57 million incorrect answers generated hourly. A critical finding is the prevalence of "unsubstantiated citations." For correct answers, the rate of citations that do not support the AI's summary surged from 37% to 56% with the Gemini 3 upgrade, making it difficult for users to verify information. The AI heavily relies on low-quality sources, with Facebook and Reddit being among its top-cited websites. Furthermore, the system is highly manipulable. A BBC journalist successfully "poisoned" it by publishing a fabricated article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing its use of the SimpleQA benchmark and an AI model (Oumi's own) to evaluate another AI. The company maintains its AI Overviews, combined with its search ranking systems, perform better than the underlying model alone. Critics note this defense does little to bolster user confidence in the feature's reliability.

Author: Claude, Deep Tide TechFlow

Deep Tide Guide: A recent test conducted by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is approximately 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is disseminating misinformation on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA, developed by OpenAI, to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducted in two rounds: one in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's massive scale. Google processes approximately 5 trillion search queries annually. With a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unsubstantiated citations."

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had the problem of "unsubstantiated citations," meaning the links attached to the AI summary did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it is increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The heavy reliance on low-quality sources by AI Overviews exacerbates this problem. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% rate in accurate answers.

BBC Journalist's Fake Article "Poisons" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by stating that the search AI feature is built on the same ranking and security mechanisms used to block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several concerns about Oumi's study. A Google spokesperson called the research "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model, HallOumi, to judge another AI's performance, potentially introducing additional errors; and the test content does not reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. However, Google emphasized that AI Overviews, leveraging the search ranking system, performs better in accuracy than the model alone.

Nevertheless, as PCMag pointed out in a logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this likely does not enhance user confidence in your product's accuracy.

Preguntas relacionadas

QWhat was the accuracy rate of Google's AI Overviews feature as tested by Oumi, and how many errors does this translate to per hour given Google's search volume?

AThe accuracy rate of Google's AI Overviews was found to be 91% in the test. Given Google's annual volume of 5 trillion searches, this 9% error rate translates to over 57 million inaccurate answers generated every hour.

QAccording to the Oumi study, what was the trend in 'unsubstantiated citations' between the Gemini 2 and Gemini 3 versions of the AI Overviews?

AThe problem of 'unsubstantiated citations' (where the provided links did not support the AI's answer) increased from 37% with Gemini 2 to 56% with the upgraded Gemini 3.

QWhich low-quality websites were identified as major sources frequently cited by Google's AI Overviews?

AFacebook and Reddit were identified as the second and fourth most frequently cited sources by the AI Overviews feature.

QHow did a BBC journalist demonstrate the vulnerability of Google's AI Overviews to manipulation?

AA BBC journalist tested the system by publishing a deliberately fabricated article. Within 24 hours, Google's AI Overviews began presenting the false information from that article as a factual answer to user queries.

QWhat were Google's main criticisms of the Oumi study's methodology?

AGoogle criticized the study for having 'serious flaws,' stating that the SimpleQA benchmark itself contains inaccuracies, that using Oumi's own AI model to judge another AI could introduce errors, and that the test queries did not reflect real user search behavior.

Lecturas Relacionadas

"Teletubbies" Robot Cleaning Service, $30/Hour, Pure·Manual·Intelligence

Anthropomorphic "Teletubby" robot offers cleaning services in San Francisco at $30/hour, but it's entirely remote-controlled. The robot, created by startup Tau Robotics, can perform household tasks like washing hands, mopping floors, and taking out trash. While the initial demo videos appear impressive and are notably shown at normal speed (unlike many sped-up robot demos), the company reveals the actions are performed via human teleoperation, not autonomous AI. Tau Robotics, founded in 2024, argues this "cheat" is a strategic way to bridge the current capability gap, ensure task completion, and collect real-world home data to eventually train autonomous systems. Their service features three robot models: Chelsea for kitchens/bathrooms, Elon for regular tidying with memory, and Tony for deep cleaning. Priced at $30 per hour, it's cheaper than average human cleaners in the US. The article discusses the broader challenge of deploying humanoid robots in homes, comparing Tau's approach to others like China's Ziliang and the US's 1X Neo, which also use teleoperation. A key reason for choosing a humanoid form is to make remote control more intuitive for human operators. The piece also notes the potential "emotional value" of having a humanoid servant. The service is currently invite-only in San Francisco.

marsbitHace 15 min(s)

"Teletubbies" Robot Cleaning Service, $30/Hour, Pure·Manual·Intelligence

marsbitHace 15 min(s)

From South Korea to the United States: Blue-Collar Jobs Are Becoming Increasingly Popular, Thanks to AI

AI is reshaping the labor market's value proposition. The traditional four-year college degree is losing its appeal as a guaranteed career path, while skilled blue-collar trades like electricians, welders, and plumbers are experiencing historic demand and wage premiums. This shift is driven by dual pressures: AI's displacement of certain white-collar roles and a booming need for physical infrastructure and data center construction. Data confirms the trend. In the U.S., vocational school revenue surged, and a significant portion of recent layoffs are AI-related. Surveys show a majority of Gen Z adults plan to pursue blue-collar work, citing better job security against AI automation. Vocational education interest has exploded recently. Experts cite a psychological shift as younger generations seek tangible, AI-resistant careers and avoid high student debt. In many cases, salaries for skilled trades now match or exceed those requiring a bachelor's degree. In South Korea, semiconductor vocational high schools boast near-total employment, with graduates securing high-paying roles at companies like Samsung. The shortage is structural, exacerbated by a retiring baby boomer workforce and massive infrastructure projects. Companies like JPMorgan Chase, Meta, and Lowe's are investing heavily in training programs. However, overcoming historical stigma and a "perception gap" around trade careers remains a key challenge to closing the talent gap.

marsbitHace 59 min(s)

From South Korea to the United States: Blue-Collar Jobs Are Becoming Increasingly Popular, Thanks to AI

marsbitHace 59 min(s)

Qualcomm: AI Hype Subsides, When Will Smartphones Emerge from the Gloom?

Qualcomm reported its Q3 FY2026 results (ending June 2026), with revenue of $9.95B, down 4% YoY but above expectations. Gross margin declined to 53.1%, pressured by rising costs across manufacturing and memory. Key business segments showed mixed performance: Handset revenue fell 19.6% YoY to $5.09B, dragged by an 11% decline in non-Apple Android shipments and weaker high-end mix. Conversely, Automotive revenue surged 61% to $1.59B, and IoT grew 9% to $1.83B. Core operating profit dropped 41% YoY due to margin compression and higher expenses. Management's Q4 FY2026 guidance projects revenue of $9.7B-$10.5B, in line with consensus, but Non-GAAP EPS guidance of $2.05-$2.25 fell short of expectations. Amidst persistent weakness in its core handset market, Qualcomm is pursuing growth in AI, focusing on Edge AI (smartphones, PCs, automotive) and Data Center AI. Its data center strategy includes four pillars: AI accelerators (e.g., AI200), commercial CPUs (Dragonfly C1000), custom silicon, and connectivity solutions. While these initiatives initially boosted its stock, concerns over AI capital expenditure sustainability have since erased those gains. The company targets $5B in data center revenue for FY2027 and $15B for FY2029. The report concludes that with the traditional handset business still under pressure, the data center opportunity is currently viewed as a longer-term option, and a more conservative valuation based on core operations may be warranted until AI contributions materialize.

marsbitHace 1 hora(s)

Qualcomm: AI Hype Subsides, When Will Smartphones Emerge from the Gloom?

marsbitHace 1 hora(s)

From TPU to Self-Evolving Agents: How Jeff Dean Predicts the Next Step in AI

At the 2026 YC Startup School, Jeff Dean outlined his vision for AI's next phase, shifting focus from simply scaling models to building intelligent, autonomous systems. He believes AI's progress is no longer just about creating smarter models, but about integrating them into systems capable of long-term, iterative work, automated experimentation, and continuous learning. This evolution moves the competition from "who has the bigger model" to "who can best organize intelligence." Dean suggests AI capabilities are now comparable to a junior engineer, enabling the automation of complex workflows. However, the true challenge and opportunity lie in managing these AI "workers" at scale. He emphasizes the importance of **context engineering**—structuring tools, memory, and feedback loops—over raw model power. For startups, this means building deep expertise in niche domains where general models currently fail (near 0-1% success rates), leveraging proprietary data, specialized tools, and domain-specific evaluators. A recurring theme is re-examining fundamental constraints. Dean's past work, like moving Google's search index to memory or creating the TPU, stemmed from questioning outdated assumptions about hardware and cost. He sees similar inflection points today, particularly in **specialized inference hardware** to drastically reduce latency and energy consumption for real-time Agent operation. Notably, he points out that in modern AI systems, the dominant cost is often not computation but **data movement**. Reliable, long-running Agents require robust system design, borrowing concepts from distributed computing like checkpointing, state management, and parallel exploration to handle failures and maintain progress over days or weeks. As AI automates execution, the scarcest human skills will shift to **defining clear specifications**, **judging what problems are worth solving** (taste), and designing effective feedback loops. Ultimately, Dean's framework prioritizes understanding the problem deeply, identifying the true bottlenecks, and systematically building closed-loop systems where AI can not only perform tasks but also improve AI itself.

marsbitHace 1 hora(s)

From TPU to Self-Evolving Agents: How Jeff Dean Predicts the Next Step in AI

marsbitHace 1 hora(s)

Coldcard exploit sparks Bitcoin flight, ‘bullish’ crypto consolidation: Hodler’s Digest, August 2

A Coldcard hardware wallet exploit led to estimated losses of 1,367 BTC ($88.6 million), causing a spike in small Bitcoin transfers as users moved funds to centralized exchanges and other custody methods. In U.S. politics, the Clarity Act faces hurdles with time running out for a Senate vote, amid debates over ethics rules and crypto regulation. Major crypto firms like Coinbase reported disappointing Q2 earnings, while an analyst notes the industry is entering a significant consolidation phase, with revenue concentrating in a few dominant protocols like Hyperliquid and Pump.fun. Bitcoin's price decline continued, though some analysts suggest the market may have bottomed. Other news includes Telegram's legal troubles in Russia and Australia, layoffs at Pump.fun ahead of token distributions, and a White House staffer accused of insider betting leaving his post.

cointelegraphHace 1 hora(s)

Coldcard exploit sparks Bitcoin flight, ‘bullish’ crypto consolidation: Hodler’s Digest, August 2

cointelegraphHace 1 hora(s)

Trading

Spot