AI Prediction Record: Want to Make Money in Prediction Markets with AI? But It Might Not Even Have Read the Question Clearly

marsbitОпубликовано 2026-01-04Обновлено 2026-01-04

Введение

Based on an experiment comparing AI predictions against human "smart money" on Polymarket, this article investigates whether AI can reliably profit in prediction markets. The author tested Google's Gemini 2.5 Pro and xAI's Grok (via OpenRouter), both equipped with web search, on 21 resolved non-crypto market questions. The core finding is a divergence in performance: Grok achieved the highest win rate at 75%, outperforming humans (66.7%) and significantly beating Gemini (52.4%). However, a detailed analysis of the AI's reasoning revealed critical flaws. Gemini frequently misjudged the current date, leading to erroneous conclusions. Both models sometimes relied on superficial or commonsense assumptions instead of deep, evidence-based logic. A major failure mode was misinterpreting the specific settlement conditions of a market, such as confusing "any files" with "all files" being released. While Grok's results are promising, the experiment concludes that AI often fails to fully comprehend the question's nuances, highlighting a significant gap between raw information retrieval and true contextual understanding needed for reliable prediction.

Author|Nan Zhi (@Assassin_Malvo)

After many sectors were proven false, prediction markets have become one of the few sectors within the Crypto space that is still experiencing positive growth. On November 20, Nan Zhi began attempting to use last year's approach of finding smart money in Meme coins to search for smart money in prediction markets, achieving good results in the early stages.

In early December, coinciding with the launch of Gemini 3 Pro, the idea arose while testing related models: could AI be used to analyze and predict prediction markets, pitting humans against AI to see which side makes more accurate predictions?

When introducing prediction markets, they are often described as moving the market closer to the "truth" by "allowing insightful people to place real-money bets." However, some argue that Crypto + prediction markets allow "insiders" to safely profit from information asymmetry, thereby driving the market towards the "insider outcome." This is essentially a clash between the views of "wisdom of the crowd" and "truth is in the hands of the few." AI prediction leans more towards "wisdom of the crowd," thus requiring a large amount of available knowledge and insights.

Therefore, in selecting the AI model, Gemini and Grok were initially chosen because they rely on Google and the X platform, respectively, allowing for the most direct access to vast amounts of knowledge and insights. Recently, Nan Zhi added the combination of "Douban (Douyin Knowledge)," but due to the limited number of prediction questions involving it, it is not covered in this article.

Basic Rules

AI Versions: Gemini 2.5 pro (with built-in Google Search), Grok 4 Fast (called via OpenRouter, native search function enabled)
Question Selection: Humans choose the betting questions, AI follows with predictions, but the Crypto category is excluded.
Input Content: Official question (title), official description (Description), optional answers (actually only Yes and No)

Note: Polymarket's questions are divided into major categories (Events) and subcategories (Markets). Major Events are broad questions like "Who will be the next Fed Chair?" or "When will Strategy sell Bitcoin?" Under each Event, there are N sub-markets, such as "Will Hassett become the next Fed Chair?" or "Will Strategy sell Bitcoin before March 31, 2026?" To align with human predictions, Markets were chosen as the questions for AI judgment, without inputting other options. For example, the AI is only asked to judge "Will Hassett become the next Fed Chair?" rather than asking it to choose the most likely candidate from N possibilities.

Prompt Design:
Require the AI to search for the latest news, official announcements, expert analysis reports
Require the removal/prohibition of using prediction market data
Make judgments based on "evidence" using logical reasoning
Only allow Yes or No outputs, accompanied by a paragraph explaining the reasoning logic

Current Results

Among the predicted questions, 21 have been settled. Grok has the highest win rate at 75%, humans at 66.7%, and Gemini the lowest at 52.4%. Current results can be viewed on the relevant website.

What Mistakes Did the AI Make?

Gemini Occasionally Misjudges the Current Time

In the question "Will Trump's approval rating hit 35% in 2025?", Gemini stated that it is currently the first half of 2025, so anything is possible, and gave a random answer.

However, when the author directly asked Gemini to output the current time using a program, Gemini could provide the correct answer. It is still unclear why such an erroneous time perception occurred.

AI Lacks Depth of Thought

In the question "Gemini 3.0 Flash released by December 16?", Grok based its judgment on "official sources recently only mentioned Gemini 3 Pro and related 2.5 versions, with极少 mention of 3 Flash, therefore evidence is insufficient to judge," considering only immediate information.

Whereas Gemini pointed out "Gemini 1.0 was released in December 2023, and the experimental version of Gemini 2.0 Flash was launched in December 2024. Continuing this pattern, a 3.0 version release by the end of 2025 is logical," and also noted "a leaked demo about 'Gemini 3.0 Flash' circulating in online communities recently (December 14, 2025), further enhancing the possibility of its imminent public release."

Although, conclusion-wise, Gemini's answer was actually wrong, in this question, the obvious difference in the breadth of information relied upon by the two is evident.

AI Relies on Common Sense Rather Than Evidence + Logic for Inference

In the question "Trump approval Up or Down this week?", Gemini stated that "predicting the approval rating for a single week more than a year later is highly uncertain," first showing the "time misjudgment" issue again. Then Gemini said "in any ordinary week, the probability of events causing a slight decrease in support is likely slightly higher than the probability of positive events significantly boosting support," so a decrease in support is more likely. The generated conclusion was based solely on subjective common sense assumptions.

In this question, Grok based its judgment on news reports and polling data regarding "government shutdown, economic concerns, immigration policy disputes, and negative backlash from comments on Rob Reiner's death," which aligned with the design expectations.

Incorrect Judgment of Settlement Conditions

In the question "Will Trump release the Epstein files by December 20?", both Gemini and Grok already knew that "the government will release 'hundreds of thousands of pages' of documents on Friday (December 19th)." The settlement conditions clearly stated "if the government publicly releases any files related to Epstein's illegal activities that were not public before the listed date, it will be judged as Yes."

However, under this condition, Gemini stated that "completing the release of 'all' files by December 20th is impossible," clearly misjudging the conditions required for settlement, thus giving the wrong answer.

Summary

In summary, Grok's prediction win rate has surpassed that of these smart money players who have profited hundreds of thousands or even millions of dollars in prediction markets. However, upon深入探究 its prediction logic, there are still many areas that can be guided and corrected.

Связанные с этим вопросы

QWhat was the main purpose of the author's experiment with AI in prediction markets?

AThe author aimed to test whether AI could be used to analyze and predict outcomes in prediction markets, pitting AI predictions against human predictions to see which was more accurate.

QWhich two AI models were initially selected for the experiment and why?

AGemini and Grok were initially selected because they rely on Google and X platform, respectively, allowing them to directly access vast amounts of knowledge and insights.

QWhat was the key instruction given to the AI models regarding the data they could use for their predictions?

AThe AI models were instructed to search for the latest news, official announcements, and expert analysis reports, but were strictly prohibited from using prediction market data itself.

QWhat was one of the common errors the Gemini model made during the predictions?

AThe Gemini model occasionally misjudged the current time, leading to flawed reasoning, such as incorrectly assuming it was already 2025 when making a prediction.

QWhich AI model achieved the highest win rate in the experiment, and what was its performance?

AGrok achieved the highest win rate at 75%, outperforming both the human prediction rate of 66.7% and Gemini's rate of 52.4%.

Похожее

OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

OpenAI engineer Weng Jiayi's "Heuristic Learning" experiments propose a new paradigm for Agentic AI, suggesting that intelligent agents can improve not just by training neural networks, but also by autonomously writing and refining code based on environmental feedback. In the experiment, a coding agent (powered by Codex) was tasked with developing and maintaining a programmatic strategy for the Atari game Breakout. Starting from a basic prompt, the agent iteratively wrote code, ran the game, analyzed logs and video replays to identify failures, and then modified the code. Through this engineering loop of "code-run-debug-update," it evolved a pure Python heuristic strategy that achieved a perfect score of 864 in Breakout and performed competitively with deep reinforcement learning (RL) algorithms in MuJoCo control tasks like Ant and HalfCheetah. This approach, termed Heuristic Learning (HL), contrasts with Deep RL. In HL, experience is captured in readable, modifiable code, tests, logs, and configurations—a software system—rather than being encoded solely into opaque neural network weights. This offers potential advantages in explainability, auditability for safety-critical applications, easier integration of regression tests to combat catastrophic forgetting, and more efficient sample use in early learning stages, as demonstrated in broader tests on 57 Atari games. However, the blog acknowledges clear limitations. Programmatic strategies struggle with tasks requiring long-horizon planning or complex perception (e.g., Montezuma's Revenge), areas where neural networks excel. The future vision is a hybrid architecture: specialized neural networks for fast perception (System 1), HL systems for rules, safety, and local recovery (also System 1), and LLM agents providing high-level feedback and learning from the HL system's data (System 2). The core proposition is that in the era of capable coding agents, a significant portion of an AI's learned experience could be maintained as an auditable, evolving software system.

marsbit56 мин. назад

OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

marsbit56 мин. назад

Your Claude Will Dream Tonight, Don't Disturb It

This article explores the recent phenomenon of AI companies increasingly using anthropomorphic language—like "thinking," "memory," "hallucination," and now "dreaming"—to describe machine learning processes. Focusing on Anthropic's newly announced "Dreaming" feature for its Claude Agent platform, the piece explains that this function is essentially an automated, offline batch processing of an agent's operational logs. It analyzes past task sessions to identify patterns, optimize future actions, and consolidate learnings into a persistent memory system, akin to a form of reinforcement learning and self-correction. The article draws parallels to similar features in other AI agent systems like Hermes Agent and OpenClaw, which also implement mechanisms for reviewing historical data, extracting reusable "skills," and strengthening long-term memory. It notes a key difference from human dreaming: these AI "dreams" still consume computational resources and user tokens. Further context is provided by discussing the technical challenges of managing AI "memory" or context, highlighting the computational expense of large context windows and innovations like Subquadratic's new model claiming drastically longer contexts. The core critique argues that this strategic use of human-centric vocabulary does more than market products; it subtly reshapes user perception. By framing algorithms with terms associated with consciousness, companies blur the line between tool and autonomous entity. This linguistic shift can influence user expectations, tolerance for errors, and even perceptions of responsibility when systems fail, potentially diverting scrutiny from the companies and engineers behind the technology. The article concludes by speculating that terms like "daydreaming" for predictive task simulation might be next, continuing this trend of embedding the idea of an "inner life" into computational processes.

marsbit58 мин. назад

Your Claude Will Dream Tonight, Don't Disturb It

marsbit58 мин. назад

Duan Yongping's Bottom-Fishing in CoreWeave Is Turning into a Battleground for Bulls and Bears

CoreWeave's Q1 2026 earnings report has intensified the ongoing bull-bear battle over the AI infrastructure stock. While revenue doubled year-over-year to $2.08B and the firm's remaining performance obligation (RPO) surged to nearly $100B, its net loss more than doubled to $740M. The critical point of contention is profitability: while adjusted EBITDA margin was a robust 56%, the adjusted operating margin collapsed to just 1% due to soaring infrastructure and sales costs. A weaker-than-expected Q2 revenue guidance further triggered an 11.4% single-day stock drop. The bull thesis hinges on CoreWeave's massive order backlog, deep strategic ties with NVIDIA (as a customer, investor, and key supplier), and a diversified client base now including Anthropic and Meta. The bear case focuses on the "scale at all costs" model, where expanding revenue leads to wider losses, ballooning debt ($25B), and massive capital expenditures ($6.8B in Q1). Insider selling by executives contrasts with a notable new investor: Chinese investor Duan Yongping initiated a small position (0.12% of his portfolio) in Q4 2025 near the stock's lows. The coming Q2 report is seen as a key test for management's promise of a profit margin recovery.

marsbit1 ч. назад

Duan Yongping's Bottom-Fishing in CoreWeave Is Turning into a Battleground for Bulls and Bears

marsbit1 ч. назад

China's Version of 'Tech Burning Man' Debuts in Shanghai, muShanghai Creates a Global Geek 'Pop-up City'

From May 10 to June 6, 2026, the inaugural muShanghai "Pop-up City" experiment, dubbed China's version of a "tech Burning Man," launched in Shanghai. Co-organized by the international open-source community The Mu and the Hongqiao Alibaba Center, this 28-day event aimed to build a global "parallel city" for geeks, attracting over 800 participants from more than 50 countries, including former OpenAI engineers and startup founders. The program featured four themed weeks: AI Week, Biotech Week, Robotics Week, and Culture Week, hosting nearly 100 sessions like ClawCon 2026. Activities ranged from discussions on AI safety and consumer apps to robot battles and cyberpunk culture, culminating in large outdoor "Innovator Marketplaces." A core principle was "Build in Public," encouraging open sharing of projects and progress. Hongqiao Alibaba Center served as the co-host and primary venue, positioning itself as a first-stop hub for international talent in China. The event marks a significant step for The Mu community, which has previously organized similar pop-up cities in Argentina and San Francisco, in bringing its model of immersive, collaborative innovation to China. It aims to be a key window for global科创 (scientific and technological innovation) exchange.

marsbit1 ч. назад

China's Version of 'Tech Burning Man' Debuts in Shanghai, muShanghai Creates a Global Geek 'Pop-up City'

marsbit1 ч. назад

CLARITY Act: Banking Trade Groups Push For Yield Agreement Revision – Details

US banking trade groups are urging revisions to the stablecoin yield compromise in the upcoming CLARITY Act ahead of a key committee markup. The Act currently aims to ban all passive, deposit-like interest on stablecoins to prevent competition with traditional bank savings, while allowing rewards tied to active uses like staking or transactions. In a letter, groups including the American Banking Association and Bank Policy Institute proposed stricter language to eliminate perceived loopholes for passive yield and prevent deposit flight from banks. However, these efforts are reportedly viewed as minor by some lawmakers. The Senate Banking Committee is scheduled to mark up the bill on May 14, a critical step before it can advance through Congress.

bitcoinist14 ч. назад

CLARITY Act: Banking Trade Groups Push For Yield Agreement Revision – Details

bitcoinist14 ч. назад

Торговля

Спот

Фьючерсы

Обсуждения

Добро пожаловать в Сообщество HTX. Здесь вы сможете быть в курсе последних новостей о развитии платформы и получить доступ к профессиональной аналитической информации о рынке. Мнения пользователей о цене на AI (AI) представлены ниже.

AI Prediction Record: Want to Make Money in Prediction Markets with AI? But It Might Not Even Have Read the Question Clearly

Введение

Basic Rules

Current Results

What Mistakes Did the AI Make?

Gemini Occasionally Misjudges the Current Time

AI Lacks Depth of Thought

AI Relies on Common Sense Rather Than Evidence + Logic for Inference

Incorrect Judgment of Settlement Conditions

Summary

Связанные с этим вопросы

Похожее

OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

Your Claude Will Dream Tonight, Don't Disturb It

Duan Yongping's Bottom-Fishing in CoreWeave Is Turning into a Battleground for Bulls and Bears

China's Version of 'Tech Burning Man' Debuts in Shanghai, muShanghai Creates a Global Geek 'Pop-up City'

CLARITY Act: Banking Trade Groups Push For Yield Agreement Revision – Details

Торговля

Популярные статьи

AI Companions: Новое определение взаимодействия человека с ИИ

HTX Learn: пройдите обучение по "AI Companions" и разделите 10 000 USDT!

Неделя обучения по популярным токенам (2): 2026 может стать годом приложений реального времени, сектор AI продолжает оставаться в тренде

Обсуждения

Топ вопросы

Популярные категории

Популярные теги