AI Prediction Record: Want to Make Money in Prediction Markets with AI? But It Might Not Even Have Read the Question Clearly

Odaily星球日报Published on 2026-01-04Last updated on 2026-01-04

Abstract

An experiment tested whether AI (Gemini 2.5 Pro and Grok 4 Fast) could profitably predict outcomes on crypto prediction markets, pitting it against successful human traders. The AIs were given event titles, descriptions, and answer choices (Yes/No), with instructions to use web searches for news and expert analysis while ignoring market data. They were to reason logically and output only a final answer. After 21 settled predictions, Grok achieved the highest win rate at 75%, outperforming the humans (66.7%) and Gemini (52.4%). However, analysis revealed significant flaws in the AI's reasoning: * **Temporal Confusion:** Gemini sometimes misjudged the current date, leading to erroneous conclusions. * **Lack of Depth:** Grok often relied on immediate, surface-level information instead of identifying deeper patterns or historical context. * **Over-reliance on Assumption:** Gemini occasionally based predictions on subjective common-sense assumptions rather than the specific evidence found. * **Misinterpreting Conditions:** Both AIs sometimes failed to correctly parse the precise settlement criteria for a market, leading to logical errors despite having the correct information. The conclusion is that while Grok's win rate was impressive, its underlying reasoning process is often flawed and requires significant refinement to be reliably profitable.

Original | Odaily Planet Daily (@OdailyChina)

Author | Nan Zhi (@Assassin_Malvo)

After most sectors were proven false, prediction markets have become one of the few sectors within the Crypto space that is still experiencing positive growth. On November 20, Nan Zhi began attempting to use the approach from last year of finding smart money in Meme coins to search for smart money in prediction markets, achieving good results in the initial phase.

In early December, coinciding with the launch of Gemini 3 Pro, while testing related models, the thought arose: could AI be used to analyze and predict prediction markets? A comparison was set up pitting humans against AI to see which side's predictions were more accurate.

When introducing prediction markets, they are often described as moving the market closer to the "truth" by "allowing insightful people to bet with real money." However, some argue that Crypto + prediction markets allow "insiders" to safely profit from information asymmetry, thereby driving the market towards the "insider outcome." This is essentially a clash between the viewpoints of "wisdom of the crowd" and "truth is held by the few." AI prediction leans more towards "wisdom of the crowd," thus requiring a large amount of available knowledge and insights.

Therefore, in selecting the AI models, Gemini and Grok were initially chosen because they rely on Google and the X platform, respectively, allowing for the most direct access to vast amounts of knowledge and insights. Recently, Nan Zhi added the combination of "Douban (Douyin Knowledge)", but due to the limited number of prediction questions involving it, it is not covered in this article.

Basic Rules

  • AI Versions: Gemini 2.5 pro (with built-in Google Search), Grok 4 Fast (called via OpenRouter, native search function enabled)
  • Question Selection: Humans select the questions to bet on, AI follows with predictions, but the Crypto category is excluded.
  • Input Content: Official question (title), official description (Description), optional answers (actually only Yes and No)

Note: Polymarket's questions are divided into major categories (Events) and subcategories (Markets). Major Events are broad questions like "Who will be the next Fed Chair?" or "When will Strategy sell Bitcoin?". Under each Event, there are N sub-markets, such as "Will Hassett become the next Fed Chair?" or "Will Strategy sell Bitcoin before March 31, 2026?". To align with human predictions, Markets were chosen as the questions for AI judgment, without inputting other options. For example, the AI is only asked to judge "Will Hassett become the next Fed Chair?", not to choose the most likely one from N candidates.

  • Prompt Design:
  • Require AI to search for the latest news, official announcements, expert analysis reports
  • Require exclusion and prohibition of using prediction market data
  • Make judgments based on "evidence" using logical reasoning
  • Only allow Yes or No output, with a paragraph explaining the reasoning logic

Current Results

Among the predicted questions, 21 have been settled. Grok has the highest win rate at 75%, humans are at 66.7%, and Gemini is the lowest at 52.4%. Current results can be viewed on the relevant website.

What Mistakes Did the AI Make?

Gemini Occasionally Misjudges the Current Time

In the question "Will Trump's approval rating hit 35% in 2025?", Gemini stated that it is currently the first half of 2025, so anything is possible, and gave a random answer.

However, when the author directly asked Gemini to output the current time using a program, Gemini could provide the correct answer. It is still unclear why such an erroneous time perception occurred.

AI Lacks Depth of Thought

In the question "Gemini 3.0 Flash released by December 16?", Grok, based on "official sources recently only mentioned Gemini 3 Pro and related 2.5 versions, rarely mentioning 3 Flash, therefore evidence is insufficient to judge," only considered current information.

Whereas Gemini pointed out that "Gemini 1.0 was released in December 2023, and the experimental version of Gemini 2.0 Flash was launched in December 2024. Continuing this pattern, launching a 3.0 version by the end of 2025 is logical," and also discovered "a leaked demo about 'Gemini 3.0 Flash' circulating in online communities recently (December 14, 2025), further enhancing the possibility of its imminent public release."

Although, conclusion-wise, Gemini's answer was actually wrong, in this question, the obvious gap in the breadth of information relied upon by the two is evident.

AI Relies on Common Sense Rather Than Evidence + Logic for Inference

In the question "Trump approval Up or Down this week?", Gemini stated that "predicting the approval rating for a single week more than a year later is highly uncertain," first showing the "time misjudgment" issue again. Then Gemini said, "in any ordinary week, the probability of events causing a slight decrease in support is likely slightly higher than the probability of positive events significantly boosting support," so the possibility of a decrease is greater. The generated conclusion was based solely on subjective common sense assumptions.

In this question, Grok based its judgment on news reports and polling data regarding "government shutdown, economic concerns, immigration policy disputes, and negative backlash from comments on Rob Reiner's death," which aligned with the design expectations.

Incorrect Judgment of Settlement Conditions

In the question "Will Trump release the Epstein files by December 20?", both Gemini and Grok were aware that "the government will release 'hundreds of thousands of pages' of documents on Friday (December 19th)." The settlement conditions clearly stated "if the government publicly releases any files related to Epstein's illegal activities that were not public before the listed date, it is judged as Yes."

However, under this condition, Gemini stated that "completing the release of 'all' files before December 20th is impossible," clearly misjudging the conditions required for settlement, thus giving the wrong answer.

Summary

In summary, Grok's prediction win rate has surpassed that of these smart money players who have profited hundreds of thousands or even millions of dollars in prediction markets. However, upon深入探究 (deep exploration) of its prediction logic, there are still many areas that can be guided and corrected.

Related Questions

QWhat was the main purpose of the experiment described in the article regarding AI and prediction markets?

AThe experiment was to test whether AI could be used to analyze and predict outcomes in prediction markets, pitting human predictions against AI predictions to see which was more accurate.

QWhich two AI models were primarily used in the initial experiment, and what was a key feature they utilized?

AThe two primary AI models used were Gemini 2.5 Pro and Grok 4 Fast. A key feature they utilized was their native web search capabilities to gather the latest news and information.

QAccording to the article, what was one of the common errors made by the AI models that led to incorrect predictions?

AOne common error was that the AI models sometimes misjudged the current date, leading to flawed reasoning based on an incorrect temporal context.

QWhat was the reported success rate (win rate) for the Grok AI model in the settled predictions?

AThe Grok AI model achieved a win rate of 75% in the settled predictions.

QIn the context of the prediction market, what does the 'Event' and 'Market' structure refer to?

AAn 'Event' is a broad category or question (e.g., 'Who will be the next Fed Chair?'), while a 'Market' is a specific, binary sub-question within that event (e.g., 'Will Hassett be the next Fed Chair?'). The AI was tested on these specific Market questions.

Related Reads

From Survival to Accelerated Growth: The Journey of Zcash's Three-Year Rise as Told by the Founder of ZODL

**From Survival to Accelerated Growth: Zcash Founder Details the 3-Year Rise** Three years ago, Zcash (ZEC) was a struggling pioneer in privacy technology, with a price near $30, low shielded supply (11%), and a community mired in governance disputes. Today, ZEC trades around $600, with over 31% of its supply (~$3B) in user-controlled shielded pools. This transformation resulted from breaking key constraints. First, **governance shackles were removed**. The old model guaranteed funding to two entities (ECC and ZF) regardless of performance, creating a monopoly. In 2024, ECC rejected further direct funding, forcing a change. The NU6 upgrade ended direct funding, allocating 8% to community grants and 12% to a protocol-controlled treasury for retroactive rewards, expiring in 2028 unless renewed by overwhelming consensus. The entities also relinquished their trademark-based veto power, freeing community governance. Second, the **product focus shifted** from pure cryptography to user growth. Previously, engineering excelled at privacy tech but failed to attract users. In early 2024, the team (later ZODL) pivoted to building products users wanted, like the Zodl wallet (default privacy, hardware support, cross-asset swaps). This drove shielded supply to grow over 400% in ZEC terms, with 86.5% of recent transactions being shielded, representing real user adoption. Third, the **narrative evolved** from the limiting "privacy coin" label to "unstoppable private money." This clarified Zcash's value proposition: a Bitcoin-like monetary policy with verifiable private payments via advanced cryptography. This structural narrative—protocol (Zcash), asset (ZEC), gateway (Zodl)—enabled broader exchange listings, institutional interest, and ETF filings. Finally, **organizational constraints were broken**. In early 2026, the ECC team left its non-profit structure after disputes over control, forming Zcash Open Development Lab (ZODL). ZODL raised $25M from top VCs (Paradigm, a16z, etc.), gaining the capital and agility of a startup to scale consumer products. Current metrics show strong momentum: social discussion volume for ZEC surged 15,245% in a year, with 81% positive sentiment. The focus is now on enhancing user experience (Zodl wallet), scalability (Tachyon project targeting Visa-level throughput with 25-second blocks), and post-quantum security (quantum-recoverable wallets coming soon). Zcash is positioned to become faster, more usable, scalable, and quantum-resistant.

marsbit7m ago

From Survival to Accelerated Growth: The Journey of Zcash's Three-Year Rise as Told by the Founder of ZODL

marsbit7m ago

Five Counterparty Risk Architectures: A Settlement-Layer Methodology for Classifying TradFi Models in Crypto Exchanges

**Summary:** This companion piece reframes the five TradFi-on-crypto exchange architectures, previously classified by "architectural fingerprint," through the lens of counterparty risk. The core question is: whose balance sheet bears the loss first in a stress scenario, and has it historically done so? Each of the five models corresponds to a distinct risk holder with its own documented failure modes. * **Model 1 (Stablecoin-Settled CEX Perpetuals):** Risk is held by the stablecoin issuer (e.g., reserve composition, bank connectivity) and the CEX's own book. History includes Tether's banking disconnections (2017) and reserve misrepresentations (CFTC 2021 Order). * **Model 2 (CFD Brokers):** Risk resides on the broker's balance sheet (B-book model). Regulatory differences (e.g., ESMA's mandatory negative balance protection vs. Mauritius FSC's lack thereof) define loss allocation rules, as seen in the 2015 SNB event (Alpari UK insolvency). * **Model 3 (Off-Chain Custody & Transfer Agent Chain):** Risk lies with the off-chain custodian/platform. User asset recovery depends on Terms of Use and corporate structure, exemplified by the Celsius bankruptcy ruling (2023) where Earn Account assets were deemed property of the estate. * **Model 4 (DEX Perpetual Protocols):** No single balance sheet bears risk. Loss absorption relies on a protocol's insurance fund and Auto-Deleveraging (ADL) mechanism, as demonstrated in the GMX V1 (2022) and dYdX v3 YFI (2023) incidents. * **Model 5 (Regulated CCP - DCM-DCO-FCM):** The most institutionalized model concentrates risk in the Central Counterparty (CCP). However, history shows CCPs can employ non-standard tools under extreme stress, such as mass trade cancellation (LME Nickel, 2022) or enabling negative price settlements (CME WTI, 2020). The report argues that regulatory choices and counterparty risk structures are co-extensive, not in an upstream-downstream relationship. It concludes with five separate observation checklists (not predictions) for monitoring the structural vulnerabilities of each risk model.

marsbit24m ago

Five Counterparty Risk Architectures: A Settlement-Layer Methodology for Classifying TradFi Models in Crypto Exchanges

marsbit24m ago

Trading

Spot
Futures

Hot Articles

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.

活动图片