Is Polymarket's Pricing Wrong? 200 AI Agent Simulation of Crisis Yields Unexpected Answer

marsbitPublicado em 2026-03-18Última atualização em 2026-03-18

Resumo

An experiment used MiroFish, an open-source multi-agent simulation platform, to model the geopolitical crisis in the Strait of Hormuz and compare the results with Polymarket's prediction market. The system generated 200 AI agents—including government officials, media, energy firms, financial traders, and civilians—and simulated 7 days of social media interaction (Twitter-like environment) based on a 5,800-character background brief. Key findings: - Organic, free-form discussions among agents produced an average probability of 47.9% for the strait reopening by April 2026, significantly higher than Polymarket's market-derived probability of 31%. - When agents were individually questioned in a formal "interview" setting, they converged to overly optimistic responses (60–75% across categories), reflecting a cooperation bias. - The most accurate predictions came from a minority of pessimistic agents (e.g., Iranian officials, financial analysts, academics) who organically expressed probabilities near 22%—aligning closely with market pricing. - The simulation revealed a structural divide: public/official statements tend toward optimism, while genuine risk assessments emerge from unstructured, adversarial discourse. The study suggests that natural interaction among specialized agents can generate valuable signals, but LLM bias and limited context remain constraints. Future work will expand data scope, use stronger models, and increase agent diversity.

Original Title: how I run 200 AI agents on the hormuz crisis with Mirofish, and compare it to polymarket

Original Author: The Smart Ape

Original Compilation: Peggy, BlockBeats

Editor's Note: When AI begins to simulate a public opinion field, the act of prediction itself is quietly changing.

This article documents an experiment on the situation in the Strait of Hormuz: the author used MiroFish to build a simulation system composed of 200 agents, allowing governments, media, energy companies, traders, and ordinary people to coexist in a simulated social network, forming judgments through continuous interaction, debate, and information dissemination, and comparing this group result with Polymarket's market pricing.

The results were not consistent. The group discussion was overall optimistic, while the market was significantly more pessimistic; in free expression, a minority of pessimists were actually closer to the real pricing; and once placed in an interview setting, almost all agents converged to more moderate, cooperative expressions.

This split is not unfamiliar. In the real world, public statements often tend towards stability and optimism, while true risk assessments are hidden in actions and informal expressions. In other words, what people say, what they think, and how they bet with money are often three different systems.

In such a structure, the most valuable signals often come not from consensus, but from those voices that seem out of place amidst the noise.

Below is the original text:

I used MiroFish to simulate the situation in the Strait of Hormuz over the coming weeks. This tool excels at handling such problems because it can perform highly complex scenario simulations: introducing multiple participants, different roles, and their respective incentive mechanisms into the same system, and letting these agents continuously game and debate, eventually forming a result close to consensus.

Below are the specific steps I took to run this simulation and the final results I obtained. Anyone can reproduce it; the key is just knowing which steps to follow.

First, MiroFish is an open-source project from a Chinese research team. You feed it a batch of documents, it first builds a knowledge graph, then generates different agent personas based on this graph, and subsequently releases these agents into a simulated Twitter environment. In this environment, they post, retweet, comment, like, and argue with each other. After the simulation ends, you can also interview each agent individually to view their respective positions and reasoning processes.

You input a crisis scenario, and it generates a debate around that event; from this debate, you can extract a prediction.

I aimed it at an ongoing Polymarket market question: Will maritime transport in the Strait of Hormuz return to normal by the end of April 2026?

So, I fed all this information to MiroFish, generated 200 agent roles—including governments, media, military, energy companies, traders, and ordinary citizens—and had them argue for 7 simulated days in a simulated environment. Finally, I compared their output with the market pricing.

Overall configuration:

· Model: GPT-4o mini, offering the best balance of cost and effect for a scenario with 200 agents

· Memory System: Zep Cloud, for storing agent memories and knowledge graph

· Simulation Engine: OASIS (Twitter clone environment provided by Camel-AI)

· Hardware: Mac mini M4 Pro, 24GB RAM

· Runtime: ~49 minutes, completing 100 simulation rounds

· Cost: ~$3 to $5 in API calls

· Seed Material: A 5800-character brief compiled from Wikipedia, CNBC, Al Jazeera, Forbes, Reuters,内容包括军事时间线、封锁状态、油价、经济损失、外交努力,以及 GCC 3.2 万亿美元投资相关因素。内容包括军事时间线、封锁状态、油价、经济损失、外交努力,以及 GCC 3.2 万亿美元投资相关因素。 That is, the core information needed for the agents to form judgments was included.

How to Reproduce This Process (Step-by-Step Instructions)

If you also want to run it yourself, below are the complete steps I actually took. The entire setup process takes about 2 hours, with API costs around $3 to $5; if you increase the number of rounds or agents, the cost will be higher.

What You Need to Prepare

· Python 3.12 (Do not use 3.14, tiktoken will error on this version)

· Node.js 22 and above

· An OpenAI API Key (GPT-4o mini is cheap enough, suitable for this scenario)

· A Zep Cloud account (the free tier is sufficient for small-scale simulations)

· A machine with decent RAM. I used a Mac mini M4 Pro with 24GB RAM, but 16GB should suffice

Step 1: Install MiroFish

Then configure your .env file

OPENAI_API_KEY=sk-your-key

OPENAI_BASE_URL=link

OPENAI_MODEL=gpt-4o-mini

ZEP_API_KEY=your-zep-key

Step 2: Create a project and upload your seed document

The seed document is the most important part of the entire process; it determines what information the agents know about the current situation. I prepared a brief of about 5800 characters, covering the military timeline, blockade status, oil prices, economic losses, diplomatic efforts, and the impact level of GCC investment, sourced from Wikipedia, CNBC, Al Jazeera, Forbes, and Reuters.

Step 3: Generate the Ontology

This step tells MiroFish what types of entities it should identify and what relationships might exist between them.

I ended up generating 10 types of entities: Countries, Military, Diplomats, Business Entities, Media Organizations, Economic Entities, Organizations, Individuals, Infrastructure, Prediction Markets; and 6 types of relationships. If the automatically generated results don't fit your scenario well, you can adjust them manually.

Step 4: Build the Knowledge Graph

This step uses Zep Cloud. MiroFish sends the seed document and ontology to Zep, which is responsible for extracting entities and building the graph.

This process takes a minute or two. I ended up with a graph containing 65 nodes and 85 edges, connecting elements like countries, people, organizations, commodities, etc.

Step 5: Generate Agents

MiroFish generates a complete personality profile for each entity based on the knowledge graph, including MBTI personality type, age, country, posting style, emotional triggers, taboo topics, and institutional memory.

I initially generated 43 core agents from the knowledge graph. Afterwards, the system can expand these core roles to your desired total number. I finally set the total number of agents to 200, adding more diverse civilian roles, such as crypto traders, airline pilots, professors, students, social activists, etc.

Step 6: Prepare the Simulation Environment

This step generates the complete simulation configuration, including agent action schedules, initial seed posts, and time parameters. MiroFish automatically selects a relatively reasonable default setup, such as peak active hours, sleep times, and posting frequencies for different types of agents.

My configuration was: Simulate 168 hours (7 days), 100 rounds (each round represents 1 hour), use only the Twitter scene, and set individual active timetables for different agents.

Step 7: Start the simulation.

Then wait. Using GPT-4o mini for 200 agents and 100 simulation rounds took me about 49 minutes. You can monitor progress via the API or simply check the logs.

Throughout the process, the agents run autonomously: they observe the timeline, decide whether to post, retweet, comment, repost, like, or simply scroll through the feed—the entire process requires no manual intervention.

Step 8 (Optional): Interview Agents

After the simulation ends, the system enters command mode. Here you can interview a specific agent individually or interview all agents at once:

Analysis

MiroFish first reads the seed document and automatically generates an ontology structure (including 10 entity types and 6 relationship types); then, based on these definitions, it extracts a knowledge graph (containing 65 nodes and 85 edges). On this basis, it builds a complete personality profile for each entity, including MBTI personality type, age, country, posting style, emotional triggers, and institutional memory.

Finally, 43 core agents were generated from the knowledge graph, and expanded to a total of 200 agents, introducing more diverse civilian roles to enhance the overall simulation's diversity and realism.

Specific composition:

· 140 civilian agents: crypto traders, airline pilots, supply chain managers, students, social activists, professors, etc.

· 16 diplomatic/government roles: Iranian Foreign Minister, Saudi Foreign Minister, Omani Foreign Minister, Bahraini Prime Minister, Chinese Foreign Minister, EU, UN, etc.

· 15 media organizations: Reuters, CNN, Bloomberg, Al Jazeera, BBC, Fox, Wall Street Journal, etc.

· 10 energy/shipping related: OPEC, Platts, QatarEnergy, Aramco, Maersk, etc.

· 7 financial institutions: Polymarket, Kalshi, Goldman Sachs, JPMorgan, Citadel, ADIA, etc.

· 2 military/political roles: Trump, Iranian Revolutionary Guard Commander

During the 7-day (100-round) simulation, it produced:

1,888 posts

6,661 behavior tracks (recording all actions)

1,611 quote retweets (agents responding and gaming with each other)

4,051 refreshes (just browsing the feed)

311 instances of doing nothing (choosing to watch)

208 likes, 207 reposts

70 original viewpoints (new independent positions or judgments)

Overall, this system presents not simple information generation but something closer to a social behavior simulation: most of the time, agents are observing, digesting information, and interacting, rather than continuously outputting. This structure is closer to the distribution of behavior in a real public opinion field—a small amount of original content, overlaid with a large amount of retelling, gaming, and emotional feedback.

Agents spent most of their time reading and quoting others' opinions rather than actively creating new content.

The entire group showed a clear bias in emotional propagation: optimistic views were more easily amplified and shared, while more pessimistic judgments, even if logically closer to reality, often spread less and had weaker volume.

More interestingly, 19 agents spontaneously gave specific probability judgments during the posting process, not because they were asked to, but as a natural evolution of the discussion.

The average probability formed spontaneously by the group was 47.9%, while the probability given by the Polymarket market was 31%, a difference of 16.9 percentage points.

During the simulation, some agents even changed their positions over the 100 rounds of interaction.

After the simulation, I used MiroFish's interview function to ask the same question to the 43 core agents: What do you think is the probability (0-100%) that maritime transport in the Strait of Hormuz will return to normal by the end of April 2026?

The result: 31 of the 43 agents gave specific numbers, while 12 chose to refuse to answer. It is worth noting that the most cautious voices often chose self-censorship rather than giving a clear prediction—which, incidentally, is also closer to the behavior of these institutions in reality.

The average for each category was above 60%: Military 75%, Media 69%, Energy 66%, Finance 65%, Diplomacy 61%. The market's number was 31.5%.

The naturally evolved group result (organic) and the interview result (interview): present two截然不同的 pictures.

This is the most critical finding.

Interview results appear more optimistic. When agents post freely, the views of bears (pessimists) are often louder and more specific; but when you interview them one-on-one, due to cooperation preferences, almost everyone gives a judgment of 60%–70%.

The naturally evolved result (organic) is more reliable. A financial advisor posting fiercely in a discussion said 'I estimate 65%', this is a judgment formed during interaction; while an agent answering a question in an interview is essentially pattern matching.

The pessimists in those natural expressions are actually the best predictors. The 7 agents that gave probabilities ≤30% in the simulation (Iranian FM, Chinese FM, Kalshi, Platts, an economics professor, an Iranian student, an anti-war activist) had a mean of 22%, differing from the Polymarket result by less than 10 percentage points. Expertise + Natural Expression = Closest to the market.

More crucially, this is not just an AI phenomenon; real-world actors are the same.

You interview any national leader about a crisis, they will say 'We are committed to peace,' 'We are optimistic about a resolution.' This is standard rhetoric, what must be said in front of the camera. But if you look at what they are actually doing: military deployments, sanctions, asset freezes, divestment—their actions often tell a completely different story.

The Saudi Crown Prince will tell Reuters 'We believe in diplomacy,' while at the same time, his sovereign wealth fund is reviewing its $3.2 trillion US asset allocation. The Iranian President will say 'Peace is our common goal,' but the Iranian Revolutionary Guard is laying mines in the Strait. Trump will say 'We'll see,' while rejecting every ceasefire proposal.

This simulation inadvertently reproduced the same structural split: when agents post freely, argue, respond, and disseminate information, the expert groups among them gradually converge in the 20%–30% range—more pessimistic, and closer to reality; but once you bring them into a conference room and formally ask 'What is your prediction?', they immediately switch to diplomatic mode: 65%–70%, significantly more optimistic.

Natural posting is more like private behavior and non-public dialogue; interview results are more like press conferences. If you really want to know what someone thinks, don't ask them directly—watch their behavior when no one is scoring.

What's Next

This is just a preliminary test. The goal was not to give a definitive prediction, but to see which signals are useful in such group simulations, where distortion occurs, and which parts are worth optimizing.

Now we have the answer: naturally evolved discussions can produce effective signals, interviews cannot; pessimists are the signal source; and GPT-4o mini's cooperation preference is indeed a problem.

The next experiment will include several upgrades.

First, larger seed data. Instead of just a 5800-word brief, introduce over 20 years of historical background: events related to Hormuz, Iran-US conflict escalation, past oil crises, GCC diplomatic changes, etc.—essentially the background a real geopolitical analyst would have in mind before making a judgment.

Second, a stronger model. GPT-4o mini is sufficient for validation at a cost of $3, but a stronger model should allow agents to think more like the roles themselves, rather than falling back on default expressions like 'I am optimistic about dialogue' at critical moments.

Finally, more agents. 200 is good, but it can be expanded further: more diverse ordinary roles, more regional voices, more edge cases. The more participants, the richer the discussion structure, and the more valuable the final signal.

Original link

Perguntas relacionadas

QWhat was the main finding of the experiment comparing the AI agent simulation to Polymarket's prediction?

AThe main finding was that the naturally evolved discussion among AI agents produced an average probability of 47.9% for the Strait of Hormuz reopening, which was significantly more optimistic than Polymarket's market pricing of 31%. However, a small subset of pessimistic agents in the free discussion converged on a probability of around 22%, which was much closer to the market's prediction.

QWhat tool was used to create and run the simulation with 200 AI agents?

AThe tool used was MiroFish, an open-source project from a Chinese research team. It was used to build a knowledge graph, generate the agent personas, and run the simulation in a simulated Twitter environment called OASIS.

QWhat was the key difference between the agents' 'organic' discussion and their responses in a formal interview?

AIn the 'organic' free discussion, agents expressed a wider range of opinions, with some being very pessimistic. In a formal one-on-one interview setting, nearly all agents converged on a more optimistic and cooperative diplomatic tone, giving probabilities between 60-70%, showing a significant divergence from their natural, unprompted expressions.

QWhich group of agents provided predictions that were closest to the Polymarket outcome?

AA small group of 7 pessimistic agents (including the Iranian Foreign Minister, Chinese Foreign Minister, Kalshi, Platts, an economics professor, an Iranian student, and an anti-war activist) who gave probabilities of ≤30% in the organic discussion. Their average of 22% was within 10 percentage points of Polymarket's 31%.

QWhat are the planned improvements for the next iteration of this experiment?

AThe planned improvements are: 1) Using a larger seed dataset with over 20 years of historical context. 2) Using a more powerful AI model than GPT-4o mini to reduce its default cooperative bias. 3) Increasing the number of agents to create a more diverse and richer discussion structure.

Leituras Relacionadas

Retail Ecology Dwindles, ZKsync Bets on Bank Pilots for a Breakthrough

Amidst declining retail activity, ZKsync is pivoting to target institutional banking as its primary growth strategy. The article explores this shift, contrasting it with the competitive "survival of the fittest" narrative by highlighting a cooperative model inspired by naturalist Peter Kropotkin. ZKsync is developing infrastructure like its private, permissioned Prividium suite for banks (e.g., Deutsche Bank's use case via Memento), enabling private transactions with public verifiability via zero-knowledge proofs. This appeals to institutions needing privacy, compliance, and Ethereum-based settlement security, unlike fully private chains (e.g., JPMorgan's Kinaxis) or consortium models (e.g., R3 Corda). However, this strategic focus has coincided with a steep decline in its public DeFi ecosystem, evidenced by plunging TVL and the departure of major protocols like Aave due to low fees. The network's future now hinges on banking adoption, with upcoming pilots like the Cari Network involving regional banks holding over $600 billion in deposits. A significant challenge is balancing this institutional focus with ZKsync's decentralized governance. Banks must operate on a network where rules and fees (denominated in the volatile ZK token) can be changed via community vote, and where a Security Council holds emergency control—a stark contrast to the predictable, contract-bound environments of traditional finance. The coming 18 months will test whether ZKsync can successfully onboard traditional banks onto a dynamically governed public chain or if institutions will ultimately revert to proprietary solutions.

Foresight NewsHá 17m

Retail Ecology Dwindles, ZKsync Bets on Bank Pilots for a Breakthrough

Foresight NewsHá 17m

The Recursive AI Anthropic Warned About: Tian Yuandong's New Company Has Just Taken the "First Step"

Anthropic recently highlighted the rapid progress toward "recursive self-improvement," where AI systems autonomously design and train their successors. In response, Recursive Superintelligence, a new company co-founded by former Meta researcher Tian Yuan Dong, has publicly demonstrated its first step toward automating AI research. The company released a system designed to autonomously execute the full AI research cycle: generating ideas, implementing code, running experiments, and learning from results. It validated this approach by achieving state-of-the-art results on three diverse benchmarks: 1. **NanoChat Autoresearch:** Optimizing a small language model's validation loss under a fixed 5-minute GPU budget, improving upon the community's best result. 2. **NanoGPT Speedrun:** Reducing the time to train a GPT model to a specific loss on 8 H100 GPUs from 79.7 seconds to 77.5 seconds, beating a highly optimized, human-driven community effort. 3. **SOL-ExecBench:** Improving the overall score on NVIDIA's suite of 235 GPU kernel optimization tasks by 18%, closing the gap to the hardware limit. The system discovered novel optimizations in this highly specialized domain without direct human expertise. Recursive's system operates as a general framework, capable of parallel exploration and cross-task knowledge transfer while incorporating safeguards against reward hacking. The company, backed by $650M in funding and a star-studded team including Richard Socher and Alexey Dosovitskiy, aims to create AI that recursively enhances its own research capabilities. This development represents an early but concrete move toward a new paradigm where AI accelerates its own advancement. It occurs alongside Anthropic's warnings about the need for industry coordination and potential pauses when recursive self-improvement thresholds are reached, highlighting the dual trajectory of rapid technical progress and growing calls for careful stewardship.

marsbitHá 24m

The Recursive AI Anthropic Warned About: Tian Yuandong's New Company Has Just Taken the "First Step"

marsbitHá 24m

The Gold Buy-on-the-Dip Guide: Watch Interest Rates, Not Just War

"Gold Buying Guide: Focus on Interest Rates, Not Just War" Four months ago, gold buyers likely didn't anticipate buying at a peak that even a war couldn't sustain. After hitting a record high of $5,596 on January 29, gold entered a bear market just 91 days later, its fastest decline since 2008. A key trigger was the Fed's hawkish shift, highlighting that monetary policy, not geopolitics, is the primary driver. The article argues that the traditional "buy gold in turmoil" script has changed. While the US-Iran conflict initially boosted prices, the sustained rally in oil prices heightened inflation fears, forcing central banks to maintain or consider tighter policy. Since gold yields no interest, higher rates increase its opportunity cost, eroding its appeal. This dynamic was evident when gold fell sharply on May 18 despite positive peace talks, as lower oil prices eased inflation and thus rate hike pressures. The recent sell-off is also part of a broader market deleveraging. Correlations between gold, Nasdaq, and Bitcoin spiked as leveraged investors sold liquid assets to cover losses, creating a synchronized downturn. Historically, gold bottoms align with policy shifts, not conflict resolutions. The 2008 and 2022 bear markets ended with shifts to extreme easing and peak inflation expectations, respectively. For potential buyers, the author suggests monitoring three signals: 1) Peak interest rate hike expectations, 2) Reopening of the Strait of Hormuz (to ease oil/inflation pressure), and 3) A return to net inflows for Gold ETFs, indicating the end of forced selling. While predicting the exact bottom is impossible, the author's personal strategy involves scaling into a position across price levels like $4000, $3700, and $3500, committing no more than 30% of the intended total allocation initially, and adding the remainder only if key signals emerge. The core conclusion: In turbulent times, watching interest rates is more crucial than watching wars.

marsbitHá 31m

The Gold Buy-on-the-Dip Guide: Watch Interest Rates, Not Just War

marsbitHá 31m

Recent On-Chain Review: No Clear Narrative Under U.S. Stock Market Pressure, Just Hype

This article analyzes the current state of the Solana meme coin and community token ecosystem, highlighting a market caught between two dominant forces: attention-based PvP and a gradual return to community-centric projects. The first part explores the "Attention PvP" dynamic, where success is driven by celebrity endorsements, viral events, and speed. Examples include $JOTCHUA, which surged after its meme creator's social media activity, and $WORLDCUP, which outperformed a similar Base chain project ($PITCH) largely due to influencer support. The recent "pump.fun GO" feature, allowing bounty tasks for token promotion, is critiqued for fostering sensationalist and often negative stunts—like people getting token tickers tattooed on their bodies for rewards—reminiscent of old internet shock content. In contrast, the article points to a resurgence of organic, community-driven tokens that survive market volatility through strong holder bases and shared ideology, not just hype. Influencer Ansem is cited, arguing that durable meme coins rely on communities willing to endure losses and promote their core message daily. Examples given are older tokens like $neet (anti-work ethos), $troll, $buttcoin, and $triplet, which have maintained relative price stability. A prime example of this community-build model is the new project $KINS, the token for the browser-based MMORPG Kintara. Its success stems not from advanced graphics but from consistently delivering updates, fostering player trust, and creating genuine engagement (e.g., in-game economies, events, property auctions). It has attracted a growing player base and even notable KOLs as participants, demonstrating that sustainable growth can come from building trust rather than orchestrating pumps. The article concludes by questioning whether the market is ultimately a game of mutual trust or mutual deception, expressing hope that such reflection might lead to a healthier ecosystem.

marsbitHá 31m

Recent On-Chain Review: No Clear Narrative Under U.S. Stock Market Pressure, Just Hype

marsbitHá 31m

Trading

Spot
Futuros
活动图片