Just Now, Chinese AI Enters Top 2 in Global Programming, Only Claude Remains Ahead

marsbitPublished on 2026-05-27Last updated on 2026-05-27

Abstract

**China's AI Ranks Second Globally in Programming, Trailing Only Claude** Today, Alibaba's Qwen3.7-Max achieved a score of 1541 on the Code Arena benchmark, securing fourth place globally and surpassing top models like GPT-5.5 and Gemini 3.5 Flash. Among the top positions, it is now the only non-Claude model, placing second overall after Anthropic's Opus models. Before this official ranking, Qwen3.7-Max had already gained recognition overseas. In practical tests, it outperformed rivals on tasks like creating a self-training Tetris AI and generating complex 3D models, often at a significantly lower cost. Developers praised its ability, especially when integrated with tools like Hermes Agent and OpenCode, to effectively replace models such as GPT-5.5. In a hands-on challenge to create a 3D racing game from a detailed prompt, Qwen3.7-Max delivered a fully playable HTML file in the first attempt, requiring only minor bug fixes. It uniquely included a start menu and sound effects—details missed by other models. While competitors like Gemini 3.5 Flash and Claude Opus 4.6 produced less polished or functional versions, and GPT-5.5 had its own quirks, Qwen3.7-Max stood out for its initial completeness and playability. This performance stems from its design as an "Agent Base Model," built for long-duration, autonomous task execution. Internal tests show it can run continuously for 35 hours, making over 1158 tool calls without context degradation or instruction drift. Key technical ...

Today, the latest Code Arena leaderboard is out!

Qwen3.7-Max, with a score of 1541 points, broke into the global top four, surpassing top-tier models like GPT-5.5 and Gemini 3.5 Flash.

Ahead of it, only Claude Opus 4.7 and Opus 4.6 remain.

In other words, in the global arena for programming models, Alibaba is the only Chinese player to make it to this top table, second only to Anthropic, securing the number two spot.

Qwen3.7-Max Breaks into Global Top Five

The Only Non-Claude Model

Even before the Code Arena leaderboard was released, Qwen3.7-Max had already made a name for itself among overseas developer communities.

Atomic Chat conducted a head-to-head comparison, pitting Opus 4.7, GPT-5.5, and Qwen3.7-Max against each other on a task to write a self-training Tetris AI.

The result? Qwen3.7-Max not only outperformed both Opus 4.7 and GPT-5.5 at a token cost of just $1.32 but also improved performance by 56%.

Another overseas developer had Qwen3.7-Max build a 3D model of the universe, and the result was described as stunning.

In the task of generating a "3D Pixel Art Miniature Pagoda Model," Qwen3.7-Max's output speed and quality were also comprehensively superior.

Developer Paul Couvert even highly praised Qwen3.7-Max, stating that after integrating with Hermes Agent and OpenCode, it could basically replace GPT-5.5 and Opus 4.7.

Programming, A True Contender

However, scores are one thing; real-world testing is another.

We arranged a hardcore "Racing Game" challenge for Qwen3.7-Max.

With a detailed Prompt input, in no time, Qwen3.7-Max directly output a playable HTML file.

The first version had a small bug: the A/D steering keys were reversed.

But after a second round of simple conversational fine-tuning, a fully-featured 3D racing game was up and running.

The moment it opened, to be honest, was a bit of a shock.

Four cars racing together on a 3-lap circular track, over 100 coins scattered on the track, hitting obstacles causes slowdowns and loss of control.

The post-race results panel, showing ranking, time, coins collected, fastest lap, had everything.

But what was truly surprising were two details that only Qwen3.7-Max got right.

One was the start screen. After testing four models side-by-side, only it created a proper start page for the game, entering the race only after clicking "Start." The other three went straight into racing without even a title screen.

The other was sound effects. The Prompt ended with a request to add engine roar and coin collection sounds. Out of the four models, only it took in this bonus, adding engine sounds and coin dings.

Now let's look at the performance of the other contestants.

Gemini 3.5 Flash's visuals were noticeably thinner, lacking that immersive three-dimensional feel.

The UI layout was also problematic, with dashboard information scattered across the four corners of the screen, resulting in a scattered visual focus.

In contrast, Qwen3.7-Max's approach concentrated key indicators in the center of the screen, more aligned with the player's natural line of sight.

Claude Opus 4.6's result was somewhat... hard to describe.

Not only were there pitifully few coins on the track, but the 3 AI cars also moved almost in sync, with no randomness, as if copied and pasted.

Finally, GPT-5.5.

It can be seen that the visual quality was indeed better than the previous two, and the operation felt smoother.

But for some reason, coins were made into yellow "donuts"...

The shape is a minor issue. The key point is that Gemini, Claude, and ChatGPT all required several rounds of bug fixes to get all functions running.

Only Qwen3.7-Max's first-round generation was basically playable.

Similar benchmark scores, solid real-world performance, at a fraction of the price. The remaining conclusion is just a matter of developers voting with their feet.

The "Foundation" Model for the Agent Era

The reason Qwen3.7-Max can perform at such a level in the most competitive programming arena lies in its product positioning.

A few days ago, when Alibaba released Qwen3.7-Max, they gave it a very special label: Agent Foundation Model.

It was born to be a model designed for long-duration autonomous task execution.

Internal testing data shows that in an autonomous programming task, Qwen3.7-Max ran continuously for 35 hours, executing 1158 tool calls.

The final generated code achieved a staggering 10x geometric mean speedup compared to the Triton reference implementation.

Even more impressive is its "endurance" capability—

Even after 30 hours into the reasoning process, the model remained sharp, continuously uncovering new optimization spaces.

Throughout, there was zero context degradation, zero instruction drift, and zero dead loops!

It must be said, the difficulty isn't in the 1000 tool calls themselves. Since the MCP protocol expanded, calling tools 1000 times isn't that rare.

The difficulty lies in 35 hours of coherent reasoning.

Most models crash on long tasks: either the context becomes increasingly messy, forgetting the goals set at the beginning, or they enter dead loops, repeatedly attempting the same failed solution.

Qwen3.7-Max has made "continuously doing the right thing" a reality.

Revealing the Core Technology

We understand that this leap in programming for Qwen3.7-Max likely stems from upgrades in two key training methods.

First, Environmental Expansion.

During programming training for Qwen3.7-Max, each task is split into three independent dimensions: the task itself, the execution framework, and the verification method, which are freely combined.

The same problem might be solved within the Claude Code framework, sometimes in OpenClaw, and other times with a different verification method.

The effect is like an intern being rotated through all project teams. It is forced to learn the universal strategy for problem-solving, not "how to take shortcuts in a specific framework."

This explains a counterintuitive phenomenon: Qwen3.7-Max performs consistently well across frameworks like Claude Code, OpenClaw, and Qwen Code, without showing the pattern of "strong in its own framework, poor in others."

The second upgrade is, Long-Horizon Autonomous Execution.

In training, the team introduced a "Dynamic Accumulative Survival Game" framework.

This means making the model perform over a thousand steps of continuous decision-making in a continuously changing simulated environment, establishing its own hypotheses, adjusting strategies based on feedback, and avoiding "context corruption" from running too long.

Here's a telling data point: in the YC-Bench simulation of running a startup for a full year, Qwen3.7-Max achieved $2.08 million in revenue, double that of the previous generation ($1.05 million).

More crucially, it demonstrated strategic evolution: autonomously adjusting direction mid-term during a crisis, identifying and blocking malicious clients, eventually converging to a stable execution loop.

This is the underlying support for the 35-hour kernel optimization case and explains why on Kernel Bench L3, Qwen3.7-Max achieved speedup effects in 96% of scenarios.

And programming is just the first battlefield. This foundation of long-horizon reasoning and tool calling points to a greater ambition—a universal Agent foundation.

The Programming Finals Have a New Disruptor

Since its launch, Code Arena has always tested hard skills: multi-step reasoning, tool orchestration, complete project delivery—all real, Agent-level challenges.

Today, with a score of 1541 points, Qwen3.7-Max wedged itself into fourth place, positioned between Opus 4.6 Thinking and Opus 4.6.

On this track where Claude has dominated for over half a year, it has given its answer: Chinese models are not just followers; they can also be definers.

The global programming model competition is no longer a one-man show in Silicon Valley.

References:

https://arena.ai/leaderboard/code/webdev

This article is from the WeChat public account "AI Era Insights" (新智元), author: ASI启示录

Related Questions

QAccording to the article, what is the global ranking and score of Qwen3.7-Max on the Code Arena leaderboard?

AAccording to the article, Qwen3.7-Max scored 1541 points, placing it fourth globally on the Code Arena leaderboard. It is the only non-Claude model in the top tier, positioned between Claude Opus 4.6 Thinking and Opus 4.6.

QWhat key characteristic of Qwen3.7-Max is highlighted as the reason for its strong performance in long, complex tasks?

AThe article highlights that Qwen3.7-Max is specifically positioned as an "Agent foundation model." It is designed for long-term autonomous task execution. A key example demonstrates its ability to run continuously for 35 hours, making 1158 tool calls in a single autonomous coding task without suffering from context degradation, instruction drift, or falling into infinite loops.

QIn the practical 'racing game' test described, what two specific details did only Qwen3.7-Max successfully implement compared to other models like GPT-5.5, Claude Opus 4.6, and Gemini 3.5 Flash?

AIn the 'racing game' development challenge, only Qwen3.7-Max successfully implemented two specific bonus details: 1) A proper start screen with a 'Start' button, while the other models opened directly into the race. 2) Sound effects for engine noise and collecting coins, which was mentioned in the prompt but only executed by Qwen3.7-Max.

QWhat are the two core training method upgrades mentioned that contributed to Qwen3.7-Max's programming capabilities?

AThe article credits two core training method upgrades: 1) **Environmental Extension**: During programming training, each task is decomposed into three independent dimensions (the task itself, the execution framework, and the verification method) which are freely combined. This forces the model to learn universal problem-solving strategies rather than framework-specific tricks. 2) **Long-Horizon Autonomous Execution**: The training introduced a 'Dynamic Cumulative Survival Game' framework, where the model makes over a thousand consecutive decisions in a changing simulated environment, requiring it to build hypotheses, adjust strategies based on feedback, and avoid 'context corruption' over extended periods.

QWhat does the article claim is the broader implication of Qwen3.7-Max's performance in the global AI coding competition?

AThe article claims that Qwen3.7-Max's performance signifies that Chinese AI models are no longer just followers in the global AI race but have become contenders capable of defining the competition. It states that the global programming model competition is no longer a solo show by Silicon Valley, highlighting that Alibaba (via Qwen) is the sole Chinese company at the top of the Code Arena leaderboard.

Related Reads

BIS Latest Research: The Future of Stablecoins and the Global Monetary Landscape

BIS Working Paper No. 170, released in May 2026, analyzes the impact of stablecoins on the global monetary system. The market has grown exponentially since 2014, with over 300 active stablecoins exceeding $300 billion in market capitalization. It is highly concentrated, dominated by USD-linked stablecoins (98% by market cap, mainly USDT and USDC), which function as new forms of private offshore dollar claims on blockchain. Currently, stablecoin use remains largely within crypto ecosystems for trading and DeFi collateral. Real-economy adoption, such as in cross-border payments, is nascent but growing in emerging markets and developing economies (EMDEs) facing high inflation and volatile currencies, where they facilitate capital flight and "digital dollarization." The paper assesses impacts using the Cohen-Kennen framework. For private-sector functions, stablecoins most directly affect value storage (as a dollar-denominated safe haven in EMDEs) and the medium of exchange (enhancing cross-border payment efficiency, further entrenching dollar use). Impacts on the unit of account and official-sector functions are currently limited but could indirectly constrain monetary policy autonomy and capital controls. The report outlines three potential future scenarios: 1) **Niche adoption**, where stablecoins remain crypto-centric with minimal systemic impact; 2) **Digital dollarization**, a high-risk scenario where USD stablecoins become de facto standards in EMDEs, eroding monetary sovereignty; and 3) **Local currency stablecoin integration**, an ideal but challenging scenario where regulated domestic stablecoins linked to CBDCs enhance efficiency without foreign currency substitution. Key policy recommendations emphasize global coordination: establishing uniform regulatory standards (e.g., for reserves and disclosure), strengthening cross-border supervisory cooperation, enhancing domestic defenses in EMDEs (via macroeconomic stability, improved payment systems, and CBDCs), and combating illicit activities. The paper concludes that stablecoins are a structural force reinforcing dollar dominance in the near term, posing significant risks to EMDEs' financial stability and policy autonomy. Their long-term trajectory depends on regulatory responses, adoption patterns, and the co-evolution with public digital currencies.

marsbit7m ago

BIS Latest Research: The Future of Stablecoins and the Global Monetary Landscape

marsbit7m ago

BIS Latest Research: Stablecoins and the Future of the Global Monetary Landscape

The Bank for International Settlements (BIS) Working Paper No. 170 analyzes the rise of stablecoins and their impact on the global monetary system. Stablecoins, privately issued digital tokens pegged to fiat currencies, have grown exponentially since 2014, with a market dominated by USD-pegged variants like USDT and USDC. Their core function remains within the crypto ecosystem, though use in cross-border payments and as a store of value in high-inflation emerging markets is increasing. The report identifies stablecoins as a new form of offshore dollar claims, extending dollar liquidity via blockchain. Their stability depends entirely on reserve quality and market arbitrage, lacking traditional banking safeguards. In the short term, stablecoins reinforce the US dollar's dominance, posing risks to monetary sovereignty in emerging market and developing economies (EMDEs) by facilitating "digital dollarization," which can undermine local currency deposits, capital controls, and monetary policy effectiveness. The BIS outlines three potential future scenarios: 1) Niche adoption within crypto (baseline), 2) Widespread "digital dollarization" in EMDEs (high-risk), and 3) Integration of domestic currency stablecoins (ideal but challenging). Effective global regulatory coordination is crucial to manage risks like reserve transparency, cross-border spillovers, and illicit activities. The report concludes that stablecoins represent a structural force reshaping international monetary hierarchies, presenting both opportunities for payment efficiency and significant risks to financial stability and autonomy, necessitating robust policy responses.

链捕手12m ago

BIS Latest Research: Stablecoins and the Future of the Global Monetary Landscape

链捕手12m ago

Solo Company Craze: Some Earn Millions Annually, Others See Incomes Shrink by 90%

The Rise of the "One-Person Company" (OPC): AI Fuels a Solo Entrepreneurship Wave The concept of the "One-Person Company" (OPC)—where an individual leverages AI tools to start and run a business—is gaining significant traction, hailed by some as ushering in a "golden age" for solo entrepreneurship. While success stories abound, the reality is a mixed picture of high earnings and significant struggles. The article profiles several OPC founders across different industries: * A game developer created 6 bullet-chat (danmaku) games in a year using an AI-powered workflow, earning approximately 1 million RMB. AI handled around 70% of art and 99% of coding tasks, slashing development cycles from months to about 15 days per game. * A materials researcher in Japan, using AI for tasks from translation to legal advice, earns roughly triple the salary of a local white-collar worker. * A biotech entrepreneur uses AI Agents to automate 80% of repetitive work like data analysis, doubling their previous income while gaining time freedom. * Conversely, a former tech executive turned cross-border e-commerce founder in Latin America reports a 90% drop in income compared to their previous corporate job, cautioning against blindly following the trend. Key insights from these cases include: AI dramatically lowers barriers to entry and operational costs, but does not guarantee success. It excels at automating repetitive tasks but cannot replace core human skills like creativity, project management, judgment, and client acquisition. Industry experience and existing client/resources remain critical advantages. The model suits self-starters with specific expertise but poses challenges in areas like sales, compliance, and scaling. Ultimately, while AI empowers solo ventures, entrepreneurship's inherent risks and demands persist.

marsbit19m ago

Solo Company Craze: Some Earn Millions Annually, Others See Incomes Shrink by 90%

marsbit19m ago

Goldman Sachs Research Report Analysis: Chip Shortage to Persist Until 2028, Maintain Buy Recommendations

Goldman Sachs Research Report Summary: Memory Shortage Until 2028, Maintain Buy Recommendations Goldman Sachs' latest Asia-Pacific equities report, "The 720," forecasts a sustained memory chip upcycle extending into 2028, driven by strong AI server demand visibility, limited supply growth, and binding long-term agreements. The firm believes the market significantly underestimates the cycle's duration, as evidenced by low P/E ratios for memory stocks. Key sector calls include raising 12-month price targets for Samsung Electronics and SK Hynix, and upgrading Kioxia from Hold to Buy, citing higher and more sustainable peak profits over the next 2-3 years. The report also highlights the broader AI hardware supply chain benefiting from hyperscaler capex acceleration. Recommendations include: * MediaTek (Buy) for its data center/ASIC pivot. * Eoptolink (Buy) on 1.6T optical module ramp-up. * Biren (Buy) for its AI chip migration. * Huaqin (Buy, newly covered) for its shift from consumer electronics ODM to AI data centers. * Lenovo (Buy) on the AI PC refresh cycle. Other notable mentions include China property developers (under an optimistic scenario), BYD for its affordable city NOA strategy, and select Japanese semiconductor equipment makers. A macro theme notes the divergence between AI-boom beneficiaries (e.g., Korea, Taiwan) and energy-importing economies facing inflationary pressure. The report concludes with standard disclaimers, noting that price targets are forward-looking estimates and that sell-side research has an inherent bullish bias. The core investment thesis hinges on the longevity of the memory upcycle and the AI-driven capex wave.

marsbit52m ago

Goldman Sachs Research Report Analysis: Chip Shortage to Persist Until 2028, Maintain Buy Recommendations

marsbit52m ago

Trading

Spot
Futures
活动图片