Tens of Millions of Errors Per Hour: Investigation Reveals the 'Accuracy Illusion' of Google AI Search

marsbitОпубліковано о 2026-04-13Востаннє оновлено о 2026-04-13

Анотація

A New York Times investigation, in collaboration with AI startup Oumi, reveals significant accuracy and reliability issues with Google's AI Overviews search feature. Testing over 4,300 queries showed the accuracy rate improved from 85% (powered by Gemini 2) to 91% (Gemini 3). However, given Google's scale of ~5 trillion annual searches, this 9% error rate translates to nearly 57 million incorrect answers generated hourly. A critical finding is the prevalence of "unsubstantiated citations." For correct answers, the rate of citations that do not support the AI's summary surged from 37% to 56% with the Gemini 3 upgrade, making it difficult for users to verify information. The AI heavily relies on low-quality sources, with Facebook and Reddit being among its top-cited websites. Furthermore, the system is highly manipulable. A BBC journalist successfully "poisoned" it by publishing a fabricated article; Google's AI began presenting the false information as fact within 24 hours. Google disputed the study's methodology, criticizing its use of the SimpleQA benchmark and an AI model (Oumi's own) to evaluate another AI. The company maintains its AI Overviews, combined with its search ranking systems, perform better than the underlying model alone. Critics note this defense does little to bolster user confidence in the feature's reliability.

Author: Claude, Deep Tide TechFlow

Deep Tide Guide: A recent test conducted by The New York Times in collaboration with AI startup Oumi shows that the accuracy rate of Google Search's AI Overviews feature is approximately 91%. However, given Google's scale of processing 5 trillion searches annually, this translates to tens of millions of incorrect answers generated every hour. More troublingly, even when the answers are correct, over half of the cited links fail to support their conclusions.

Google is disseminating misinformation on an unprecedented scale, and most people are completely unaware.

According to The New York Times, AI startup Oumi, commissioned by the publication, used the industry-standard test SimpleQA, developed by OpenAI, to evaluate the accuracy of Google's AI Overviews feature. The test covered 4,326 search queries, conducted in two rounds: one in October last year (powered by Gemini 2) and another in February this year (upgraded to Gemini 3). The results showed that Gemini 2's accuracy was about 85%, which improved to 91% with Gemini 3.

91% sounds good, but it's a different story when considering Google's massive scale. Google processes approximately 5 trillion search queries annually. With a 9% error rate, AI Overviews generates over 57 million inaccurate answers per hour, nearly 1 million per minute.

Correct Answers, Wrong Sources

More alarming than the accuracy rate is the issue of "unsubstantiated citations."

Oumi's data shows that in the Gemini 2 era, 37% of correct answers had the problem of "unsubstantiated citations," meaning the links attached to the AI summary did not support the information provided. After upgrading to Gemini 3, this proportion increased instead of decreasing, jumping to 56%. In other words, while the model gives correct answers, it is increasingly failing to "show its work."

Oumi CEO Manos Koukoumidis pointedly questioned: "Even if the answer is correct, how do you know it's correct? How do you verify it?"

The heavy reliance on low-quality sources by AI Overviews exacerbates this problem. Oumi found that Facebook and Reddit are the second and fourth most cited sources for AI Overviews, respectively. In inaccurate answers, Facebook was cited 7% of the time, higher than the 5% rate in accurate answers.

BBC Journalist's Fake Article "Poisons" Results Within 24 Hours

Another serious flaw of AI Overviews is its susceptibility to manipulation.

A BBC journalist tested the system with a deliberately fabricated false article. In less than 24 hours, Google's AI Overview presented the false information from the article as fact to users.

This means anyone who understands how the system works could potentially "poison" AI search results by publishing false content and boosting its traffic. Google spokesperson Ned Adriance responded by stating that the search AI feature is built on the same ranking and security mechanisms used to block spam, and claimed that "most examples in the test are unrealistic queries that people wouldn't actually search for."

Google's Rebuttal: The Test Itself Is Flawed

Google raised several concerns about Oumi's study. A Google spokesperson called the research "seriously flawed," citing reasons including: the SimpleQA benchmark itself contains inaccurate information; Oumi used its own AI model, HallOumi, to judge another AI's performance, potentially introducing additional errors; and the test content does not reflect real user search behavior.

Google's internal tests also showed that when Gemini 3 operates independently outside the Google Search framework, it produces false outputs at a rate as high as 28%. However, Google emphasized that AI Overviews, leveraging the search ranking system, performs better in accuracy than the model alone.

Nevertheless, as PCMag pointed out in a logical paradox: If your defense is that "the report pointing out our AI's inaccuracies itself uses potentially inaccurate AI," this likely does not enhance user confidence in your product's accuracy.

Пов'язані питання

QWhat was the accuracy rate of Google's AI Overviews feature as tested by Oumi, and how many errors does this translate to per hour given Google's search volume?

AThe accuracy rate of Google's AI Overviews was found to be 91% in the test. Given Google's annual volume of 5 trillion searches, this 9% error rate translates to over 57 million inaccurate answers generated every hour.

QAccording to the Oumi study, what was the trend in 'unsubstantiated citations' between the Gemini 2 and Gemini 3 versions of the AI Overviews?

AThe problem of 'unsubstantiated citations' (where the provided links did not support the AI's answer) increased from 37% with Gemini 2 to 56% with the upgraded Gemini 3.

QWhich low-quality websites were identified as major sources frequently cited by Google's AI Overviews?

AFacebook and Reddit were identified as the second and fourth most frequently cited sources by the AI Overviews feature.

QHow did a BBC journalist demonstrate the vulnerability of Google's AI Overviews to manipulation?

AA BBC journalist tested the system by publishing a deliberately fabricated article. Within 24 hours, Google's AI Overviews began presenting the false information from that article as a factual answer to user queries.

QWhat were Google's main criticisms of the Oumi study's methodology?

AGoogle criticized the study for having 'serious flaws,' stating that the SimpleQA benchmark itself contains inaccuracies, that using Oumi's own AI model to judge another AI could introduce errors, and that the test queries did not reflect real user search behavior.

Пов'язані матеріали

AI Inference Bills Soar, Shopify and Roblox Warn: Savings from Layoffs Not Enough to Cover Chip Costs

The 2026 Q1 earnings season reveals a paradox: while AI helps companies freeze hiring and boost productivity, the soaring costs of AI inference—token consumption and GPU depreciation—are eroding savings from workforce reductions. Shopify reported that AI now writes over 50% of its code, enabling significant output with stable headcount. However, LLM costs, driven by heavy usage of its AI assistant Sidekick, are pressuring its subscription毛利率. Similarly, Roblox attributed a quarter of its full-year利润率下调 to increased AI investment. The article highlights a broader industry imbalance: combined AI capital expenditure for Amazon, Meta, Microsoft, and Google is projected to reach $725 billion in 2026, vastly outpacing potential savings from layoffs. For instance, Meta's planned裁员 would save about $2.4 billion annually, offsetting only ~12% of its incremental AI depreciation. While底层 model and chip suppliers like NVIDIA maintain high profitability, application-layer companies face a pricing squeeze. Their strategies now involve要么 tightly linking AI costs to user engagement (like Shopify) or introducing fees for advanced AI features (like Roblox), as covering AI bills with裁员 savings alone is financially unsustainable.

marsbit4 хв тому

AI Inference Bills Soar, Shopify and Roblox Warn: Savings from Layoffs Not Enough to Cover Chip Costs

marsbit4 хв тому

Has Hook Summer Really Arrived? sato, Lo0p, FLOOD Ignite the New Narrative of Uniswap v4

"Hook Summer" Arrives? Sato, Lo0p, FLOOD Ignite Uniswap v4 Narrative Amidst a slight market recovery, attention within the Ethereum ecosystem has shifted to Meme coins built on Uniswap v4's Hook protocol. Following ASTEROID, tokens like sato, sat1, Lo0p, and FLOOD have become market focal points, with market caps ranging from millions to tens of millions, bringing concentrated liquidity to a narrative-dry market. Uniswap v4 Hooks are "plugin smart contracts" that allow developers to inject custom logic at key points in a liquidity pool's lifecycle (initialization, adding/removing liquidity, swaps, etc.), making the AMM programmable. Recent representative projects include: * **sato**: Market cap peaked over $38M; uses a v4 curve mechanism for minting/burning, locking ETH as reserve. * **sat1**: Market cap briefly exceeded $10M, positioning as an "optimized sato," but later declined significantly. * **Lo0p**: Market cap neared $6.6M; a "lending AMM protocol" allowing users to borrow ETH against deposited LO0P tokens without immediate selling pressure. * **FLOOD**: Market cap approached $6M; channels trading reserves into Aave v3 to generate yield, which is retained in the pool. The emergence of these Hook-based tokens could drive long-term growth for the Uniswap ecosystem by attracting users and liquidity to v4 pools. Combined with Uniswap's activated fee switch (partially used to burn UNI), the long-term outlook for UNI appears positive. However, short-term UNI price appreciation is not directly guaranteed. Factors include the sustainability and lifecycle of these new tokens, their price volatility, overall market conditions, and regulatory pressures. Currently, Uniswap v4's TVL ($595M) lags behind v3 and v2, indicating Hook adoption still requires time to mature. In summary, the Hook ecosystem serves as "long-term nourishment" for UNI, but acts more as a "catalyst" than a direct "booster" in the short term. Note: These are early-stage experimental tokens and may carry unknown risks.

marsbit14 хв тому

Has Hook Summer Really Arrived? sato, Lo0p, FLOOD Ignite the New Narrative of Uniswap v4

marsbit14 хв тому

Has Hook Summer Truly Arrived? sato, Lo0p, FLOOD Ignite the New Uniswap v4 Narrative

With the broader market showing signs of recovery, a new wave of interest has emerged around Ethereum-based meme coins. Following ASTEROID, tokens like sato, sat1, Lo0p, and FLOOD, built upon the Uniswap v4 Hook protocol, are capturing market attention. Their market capitalizations range from millions to tens of millions of dollars, injecting much-needed focused liquidity into a market lacking narratives. This article explores whether this trend signifies an incoming "Hook Summer" and its potential impact on UNI's price. Hooks are essentially plug-in smart contracts for Uniswap v4 liquidity pools, allowing developers to inject custom logic at key points in a pool's lifecycle (like initialization, adding/removing liquidity, swaps). This transforms the AMM into programmable building blocks. Key highlighted projects include: * **sato**: Peaked over $38M market cap. It utilizes a v4 curve for minting/burning; buying locks ETH as reserve to mint new tokens, while selling redeems ETH from the reserve and burns tokens. * **sat1**: Market cap briefly exceeded $10M, promoted as an "optimized sato," but later declined significantly. * **Lo0p**: Reached nearly $6.6M. It's a lending AMM protocol where buying LO0P tokens locks them as collateral, allowing users to borrow ETH from the pool reserve at 40% LTV, aiming to improve capital efficiency for idle ETH in LPs. * **FLOOD**: Peaked near $6M. Its mechanism directs asset reserves from buys into Aave v3 to generate yield, with fees and interest retained in the pool to potentially influence the token's price long-term. In the long term, the development of the Hook ecosystem can attract users and liquidity to Uniswap v4, benefiting UNI's fundamentals—especially combined with the recent activation of the protocol fee switch, where a portion of fees is used to burn UNI. However, in the short term, these Hook-based tokens are unlikely to directly drive significant UNI price appreciation. Their impact is moderated by factors like token sustainability, price volatility, and broader market and regulatory conditions. Currently, Uniswap v4's TVL ($595M) still trails behind v2 and v3, indicating adoption and growth will take time. The article concludes that while the Hook ecosystem provides long-term "nourishment" for UNI, its short-term role is more of a "catalyst" than a "booster." Readers are cautioned that these are early-stage experimental tokens and may carry unknown risks.

Odaily星球日报26 хв тому

Has Hook Summer Truly Arrived? sato, Lo0p, FLOOD Ignite the New Uniswap v4 Narrative

Odaily星球日报26 хв тому

Interview with Michael Saylor: I Did Say I Would Sell Bitcoin, But Never a Net Sale

Interview with Michael Saylor: I Said We'd Sell Bitcoin, But Never Be a Net Seller In a recent podcast, MicroStrategy Executive Chairman Michael Saylor clarified the company's stance on potentially selling Bitcoin. Following MicroStrategy's earnings call statement about being prepared to sell BTC to fund dividends for its STRC (Strategic) credit product, Saylor emphasized the distinction between selling and being a "net seller." Saylor explained the core business model: MicroStrategy sells credit instruments like STRC and uses the proceeds to buy Bitcoin, which is viewed as "digital capital" expected to appreciate around 30-40% annually. A portion of these capital gains can then be used to pay the dividends on the credit products. He stressed that even if the company sells some Bitcoin for dividends, it simultaneously buys much more with new credit issuance. For example, after raising $3.2 billion from STRC sales in April, the dividend obligation was only $80-90 million, making the company a net buyer. The clarification aims to counter market narratives questioning the value of Bitcoin on MicroStrategy's balance sheet if it were never sold, and to dismiss claims of a "Ponzi scheme." Saylor reiterated his personal philosophy for investors: "Don't be a net seller of bitcoin" and ensure your Bitcoin holdings increase each year. Saylor also discussed Bitcoin's role as the foundation for "digital credit," noting that STRC has become the largest and most liquid preferred stock issue in the U.S., offering high risk-adjusted returns (Sharpe ratio). He highlighted Bitcoin's deep liquidity, stating that even large purchases by MicroStrategy do not move the market significantly, which is driven by macro factors, geopolitical tensions, and capital flows from ETFs and credit products. Finally, Saylor reflected on his early inspiration from sci-fi books, which motivated his path to MIT, and maintained his fundamental thesis on Bitcoin remains unchanged: it is superior digital capital enabling superior digital credit.

链捕手30 хв тому

Interview with Michael Saylor: I Did Say I Would Sell Bitcoin, But Never a Net Sale

链捕手30 хв тому

Beaten SK Hynix Employees in China: Year-end Bonus Less Than 5% of Korean Staff's

"SK Hynix Chinese Staff Hit Hard: Bonuses Less Than 5% of Korean Counterparts" Driven by the AI boom, South Korea's SK Hynix is experiencing record performance, with media reports predicting massive year-end bonuses for its employees, making them highly desirable in the matchmaking market. However, this prosperity starkly contrasts with the situation for the company's Chinese employees. According to reports, SK Hynix operates under a rule allocating 10% of operating profit for employee bonuses. While projections suggest Korean employees could receive bonuses reaching millions of RMB, a Chinese employee with over a decade of technical experience revealed the disparity: "If they get 3 million, Chinese staff get less than 5% of that." After adjustments based on KPI ratings, this employee's highest bonus was slightly over 100,000 RMB. Bonuses are paid annually in Korea but semi-annually in China. During the industry downturn in 2023-2024, Chinese employees received no bonus at all. The gap extends beyond bonuses. Recruitment posts for SK Hynix's Chinese factories (in Wuxi, Dalian, Chongqing) show engineer monthly salaries ranging from 10,000 to 35,000 RMB, with a 13th-month salary promised. Chinese employees also receive standard benefits like annual leave but lack stock incentives, which are reportedly unavailable to them. Furthermore, management positions in China are predominantly held by Korean personnel, though industry observers note a gradual increase in local middle managers over time. SK Hynix has confirmed the 10% bonus rule but cautioned that specific future bonus amounts remain unpredictable. The company forecasts strong demand for HBM and other high-value enterprise products for the next 2-3 years, driven by AI infrastructure investment. This focus on business-to-business markets may continue to constrain supply for consumer products, potentially prolonging price increases for components like memory.

链捕手44 хв тому

Beaten SK Hynix Employees in China: Year-end Bonus Less Than 5% of Korean Staff's

链捕手44 хв тому

Торгівля

Спот

Ф'ючерси