AI Prediction Record: Want to Make Money in Prediction Markets with AI? But It Might Not Even Have Read the Question Clearly

Odaily星球日报Published on 2026-01-04Last updated on 2026-01-04

Abstract

An experiment tested whether AI (Gemini 2.5 Pro and Grok 4 Fast) could profitably predict outcomes on crypto prediction markets, pitting it against successful human traders. The AIs were given event titles, descriptions, and answer choices (Yes/No), with instructions to use web searches for news and expert analysis while ignoring market data. They were to reason logically and output only a final answer. After 21 settled predictions, Grok achieved the highest win rate at 75%, outperforming the humans (66.7%) and Gemini (52.4%). However, analysis revealed significant flaws in the AI's reasoning: * **Temporal Confusion:** Gemini sometimes misjudged the current date, leading to erroneous conclusions. * **Lack of Depth:** Grok often relied on immediate, surface-level information instead of identifying deeper patterns or historical context. * **Over-reliance on Assumption:** Gemini occasionally based predictions on subjective common-sense assumptions rather than the specific evidence found. * **Misinterpreting Conditions:** Both AIs sometimes failed to correctly parse the precise settlement criteria for a market, leading to logical errors despite having the correct information. The conclusion is that while Grok's win rate was impressive, its underlying reasoning process is often flawed and requires significant refinement to be reliably profitable.

Original | Odaily Planet Daily (@OdailyChina)

Author | Nan Zhi (@Assassin_Malvo)

After most sectors were proven false, prediction markets have become one of the few sectors within the Crypto space that is still experiencing positive growth. On November 20, Nan Zhi began attempting to use the approach from last year of finding smart money in Meme coins to search for smart money in prediction markets, achieving good results in the initial phase.

In early December, coinciding with the launch of Gemini 3 Pro, while testing related models, the thought arose: could AI be used to analyze and predict prediction markets? A comparison was set up pitting humans against AI to see which side's predictions were more accurate.

When introducing prediction markets, they are often described as moving the market closer to the "truth" by "allowing insightful people to bet with real money." However, some argue that Crypto + prediction markets allow "insiders" to safely profit from information asymmetry, thereby driving the market towards the "insider outcome." This is essentially a clash between the viewpoints of "wisdom of the crowd" and "truth is held by the few." AI prediction leans more towards "wisdom of the crowd," thus requiring a large amount of available knowledge and insights.

Therefore, in selecting the AI models, Gemini and Grok were initially chosen because they rely on Google and the X platform, respectively, allowing for the most direct access to vast amounts of knowledge and insights. Recently, Nan Zhi added the combination of "Douban (Douyin Knowledge)", but due to the limited number of prediction questions involving it, it is not covered in this article.

Basic Rules

AI Versions: Gemini 2.5 pro (with built-in Google Search), Grok 4 Fast (called via OpenRouter, native search function enabled)
Question Selection: Humans select the questions to bet on, AI follows with predictions, but the Crypto category is excluded.
Input Content: Official question (title), official description (Description), optional answers (actually only Yes and No)

Note: Polymarket's questions are divided into major categories (Events) and subcategories (Markets). Major Events are broad questions like "Who will be the next Fed Chair?" or "When will Strategy sell Bitcoin?". Under each Event, there are N sub-markets, such as "Will Hassett become the next Fed Chair?" or "Will Strategy sell Bitcoin before March 31, 2026?". To align with human predictions, Markets were chosen as the questions for AI judgment, without inputting other options. For example, the AI is only asked to judge "Will Hassett become the next Fed Chair?", not to choose the most likely one from N candidates.

Prompt Design:
Require AI to search for the latest news, official announcements, expert analysis reports
Require exclusion and prohibition of using prediction market data
Make judgments based on "evidence" using logical reasoning
Only allow Yes or No output, with a paragraph explaining the reasoning logic

Current Results

Among the predicted questions, 21 have been settled. Grok has the highest win rate at 75%, humans are at 66.7%, and Gemini is the lowest at 52.4%. Current results can be viewed on the relevant website.

What Mistakes Did the AI Make?

Gemini Occasionally Misjudges the Current Time

In the question "Will Trump's approval rating hit 35% in 2025?", Gemini stated that it is currently the first half of 2025, so anything is possible, and gave a random answer.

However, when the author directly asked Gemini to output the current time using a program, Gemini could provide the correct answer. It is still unclear why such an erroneous time perception occurred.

AI Lacks Depth of Thought

In the question "Gemini 3.0 Flash released by December 16?", Grok, based on "official sources recently only mentioned Gemini 3 Pro and related 2.5 versions, rarely mentioning 3 Flash, therefore evidence is insufficient to judge," only considered current information.

Whereas Gemini pointed out that "Gemini 1.0 was released in December 2023, and the experimental version of Gemini 2.0 Flash was launched in December 2024. Continuing this pattern, launching a 3.0 version by the end of 2025 is logical," and also discovered "a leaked demo about 'Gemini 3.0 Flash' circulating in online communities recently (December 14, 2025), further enhancing the possibility of its imminent public release."

Although, conclusion-wise, Gemini's answer was actually wrong, in this question, the obvious gap in the breadth of information relied upon by the two is evident.

AI Relies on Common Sense Rather Than Evidence + Logic for Inference

In the question "Trump approval Up or Down this week?", Gemini stated that "predicting the approval rating for a single week more than a year later is highly uncertain," first showing the "time misjudgment" issue again. Then Gemini said, "in any ordinary week, the probability of events causing a slight decrease in support is likely slightly higher than the probability of positive events significantly boosting support," so the possibility of a decrease is greater. The generated conclusion was based solely on subjective common sense assumptions.

In this question, Grok based its judgment on news reports and polling data regarding "government shutdown, economic concerns, immigration policy disputes, and negative backlash from comments on Rob Reiner's death," which aligned with the design expectations.

Incorrect Judgment of Settlement Conditions

In the question "Will Trump release the Epstein files by December 20?", both Gemini and Grok were aware that "the government will release 'hundreds of thousands of pages' of documents on Friday (December 19th)." The settlement conditions clearly stated "if the government publicly releases any files related to Epstein's illegal activities that were not public before the listed date, it is judged as Yes."

However, under this condition, Gemini stated that "completing the release of 'all' files before December 20th is impossible," clearly misjudging the conditions required for settlement, thus giving the wrong answer.

Summary

In summary, Grok's prediction win rate has surpassed that of these smart money players who have profited hundreds of thousands or even millions of dollars in prediction markets. However, upon深入探究 (deep exploration) of its prediction logic, there are still many areas that can be guided and corrected.

Trending Cryptos

CitreaCTR

wrapped stUSDTWSTUSDT

Velodrome FinanceVELODROME

BrevisBREV

ZRX（0X）ZRX

PancakeSwapCAKE

TRUMP holds $1.50 – Breakout setup or another Q3 bull trap?

The TRUMP memecoin is consolidating around $1.50 with rising Open Interest (OI), hinting at a potential breakout setup as bulls buy the dip. However, the broader context is bearish. TRUMP is down over 5.3% in Q3, underperforming Bitcoin and continuing a streak of quarterly losses. Bullish catalysts are fading, including declining odds for the pro-crypto CLARITY Act. With the wider memecoin sector struggling and speculative demand cooling, TRUMP's recent bounce may be a short-term rotation rather than a reversal. A break below $1.50 support could confirm a bull trap, leading to a deeper Q3 breakdown.

ambcrypto5m ago

TRUMP holds $1.50 – Breakout setup or another Q3 bull trap?

ambcrypto5m ago

Large Models No Longer "Guess" Image Scores, Use "Visual Evidence" like Structural Maps and Spectrograms as "Physical Evidence" to Score Images

Multimodal large language models (MLLMs) often perform poorly on image quality assessment (IQA), as they primarily rely on semantic understanding and are insensitive to underlying degradations like noise or blur. To address this, researchers from Northwestern Polytechnical University and Hong Kong University of Science and Technology introduced IQA-T1, a novel framework that enables MLLMs to generate and use structured "visual evidence"—such as noise residual maps, Fourier magnitude spectra, and gradient orientation coherence maps—for reasoning. IQA-T1 actively selects tools from a dedicated perceptual library to produce this evidence, shifting from intuitive "guesswork" to evidence-based assessment. Trained via supervised fine-tuning and reinforcement learning on a newly created Q-Tool dataset, the model learns to call tools efficiently. Evaluated across seven IQA benchmarks, IQA-T1 achieves state-of-the-art average PLCC/SRCC scores of 0.795/0.784 while providing explainable, traceable reasoning chains.

marsbit5m ago

Large Models No Longer "Guess" Image Scores, Use "Visual Evidence" like Structural Maps and Spectrograms as "Physical Evidence" to Score Images

marsbit5m ago

Allbridge Suspends Core Protocol After $1.65M Solana Flash Loan Exploit

Cross-chain protocol AllBridge Core suspended operations after suffering a $1.65 million flash loan exploit on its Solana deployment. The attacker manipulated a stablecoin pool's exchange rate using a $1.12 million USDC loan from Kamino, creating a price imbalance to perform a profitable arbitrage. Stolen funds were moved to Ethereum via privacy pools. This is the protocol's second such attack, following a $573,000 exploit in April 2023. The suspension causes operational delays, reduces cross-chain liquidity, and highlights persistent security vulnerabilities in bridge protocols and automated market maker systems.

TheNewsCrypto11m ago

Allbridge Suspends Core Protocol After $1.65M Solana Flash Loan Exploit

TheNewsCrypto11m ago

North Korean-Linked Contractor Infiltrated MetaMask for a Month, The Real Vulnerability in Crypto Projects Isn't in the Code

A contractor linked to North Korea gained access to MetaMask's code repository through a third-party vendor, working from March 9 until being removed in April. Consensys, MetaMask's parent company, stated no user assets, data, or security were compromised, and no malicious code was deployed. The company identified the threat, terminated access, launched an investigation, and notified law enforcement. The incident highlights critical vulnerabilities in outsourced management for crypto projects, where operational failures—not code bugs—are the primary risk. Reports indicate roughly 76% of stolen DeFi funds in early 2024 resulted from operational attacks on keys, custody, signatures, and approvals. Security guidelines recommend stringent contractor vetting—including identity verification, background checks, and multi-interview processes—along with enforcing principle of least privilege for code access. Key measures include making code activity traceable, reviewing all production changes, conducting extra scrutiny on external contributions, and swiftly revoking access when no longer needed. The event underscores the need for continuous conditional access for contractors and predefined protocols to halt deployments during security investigations.

marsbit32m ago

North Korean-Linked Contractor Infiltrated MetaMask for a Month, The Real Vulnerability in Crypto Projects Isn't in the Code

marsbit32m ago

From Gold to Bitcoin: Fixed Supply + Institutional Frenzy, Might It Repeat the 'Explosive' Price Trend?

"From Gold to Bitcoin: Fixed Supply and Institutional Frenzy May Lead to 'Explosive' Price Rally Analysts suggest Bitcoin's price action could mirror gold's over the past two decades, following the launch of spot Bitcoin ETFs. Gold ETFs, introduced in 2004, drove gold's price surge to a current market cap near $28 trillion. Both gold and Bitcoin are non-yielding stores of value, with prices driven purely by investor sentiment rather than cash flows or credit. Gold ETFs experienced dramatic cycles: explosive growth, painful drawdowns, and slow recoveries, with each cycle reaching higher peaks. Bitcoin ETFs, approved in early 2024, saw rapid institutional adoption but are now facing similar volatility. Recent warnings highlight the risk of significant ETF outflows disrupting the current rebound. BlackRock's IBIT, a leading Bitcoin ETF, has sold nearly 100,000 BTC to meet redemptions while still holding over 733,000. The core parallel is fixed supply: when demand surges, prices explode, but demand is often volatile and wave-like, not steady. Institutional interest, through ETFs and corporate adoption, remains a key support pillar, helping to cushion sell-offs. If Bitcoin captures even a fraction of gold's role as a store of value, its upside potential is immense, though the path will be marked by high volatility. For investors, focusing on long-term trends and managing risk is crucial as this 'price explosion' narrative unfolds."

Foresight News49m ago

From Gold to Bitcoin: Fixed Supply + Institutional Frenzy, Might It Repeat the 'Explosive' Price Trend?

Foresight News49m ago

Trading

Spot

Hot Articles

The Cornerstone of the Autonomous AI Economy: How Talus is Reshaping On-Chain Intelligent Agents

Talus is a decentralized AI Agent framework built on the Sui, designed to solve the structural problems of current AI systems: centralization, opacity, and a lack of native economic identity.

43.3k Total ViewsPublished 2026.03.18Updated 2026.03.18

The Cornerstone of the Autonomous AI Economy: How Talus is Reshaping On-Chain Intelligent Agents

In-depth Analysis of AI and Crypto: The Era of Symbiosis between Algorithms and Ledgers

By 2026, the integration of artificial intelligence and cryptocurrency has advanced from proof-of-concept to a new stage of "system-level integration".

2.8k Total ViewsPublished 2026.03.26Updated 2026.03.26

In-depth Analysis of AI and Crypto: The Era of Symbiosis between Algorithms and Ledgers

U.S. Equity TradFi Assets: Traditional Finance as a Steady Anchor Amid the AI IPO Boom

In 2026, the U.S. IPO market has regained momentum.

34.5k Total ViewsPublished 2026.07.08Updated 2026.07.08

U.S. Equity TradFi Assets: Traditional Finance as a Steady Anchor Amid the AI IPO Boom

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.