Just Now, Chinese AI Enters Top 2 in Global Programming, Only Claude Remains Ahead

marsbitPublished on 2026-05-27Last updated on 2026-05-27

Abstract

**China's AI Ranks Second Globally in Programming, Trailing Only Claude** Today, Alibaba's Qwen3.7-Max achieved a score of 1541 on the Code Arena benchmark, securing fourth place globally and surpassing top models like GPT-5.5 and Gemini 3.5 Flash. Among the top positions, it is now the only non-Claude model, placing second overall after Anthropic's Opus models. Before this official ranking, Qwen3.7-Max had already gained recognition overseas. In practical tests, it outperformed rivals on tasks like creating a self-training Tetris AI and generating complex 3D models, often at a significantly lower cost. Developers praised its ability, especially when integrated with tools like Hermes Agent and OpenCode, to effectively replace models such as GPT-5.5. In a hands-on challenge to create a 3D racing game from a detailed prompt, Qwen3.7-Max delivered a fully playable HTML file in the first attempt, requiring only minor bug fixes. It uniquely included a start menu and sound effects—details missed by other models. While competitors like Gemini 3.5 Flash and Claude Opus 4.6 produced less polished or functional versions, and GPT-5.5 had its own quirks, Qwen3.7-Max stood out for its initial completeness and playability. This performance stems from its design as an "Agent Base Model," built for long-duration, autonomous task execution. Internal tests show it can run continuously for 35 hours, making over 1158 tool calls without context degradation or instruction drift. Key technical ...

Today, the latest Code Arena leaderboard is out!

Qwen3.7-Max, with a score of 1541 points, broke into the global top four, surpassing top-tier models like GPT-5.5 and Gemini 3.5 Flash.

Ahead of it, only Claude Opus 4.7 and Opus 4.6 remain.

In other words, in the global arena for programming models, Alibaba is the only Chinese player to make it to this top table, second only to Anthropic, securing the number two spot.

Qwen3.7-Max Breaks into Global Top Five

The Only Non-Claude Model

Even before the Code Arena leaderboard was released, Qwen3.7-Max had already made a name for itself among overseas developer communities.

Atomic Chat conducted a head-to-head comparison, pitting Opus 4.7, GPT-5.5, and Qwen3.7-Max against each other on a task to write a self-training Tetris AI.

The result? Qwen3.7-Max not only outperformed both Opus 4.7 and GPT-5.5 at a token cost of just $1.32 but also improved performance by 56%.

Another overseas developer had Qwen3.7-Max build a 3D model of the universe, and the result was described as stunning.

In the task of generating a "3D Pixel Art Miniature Pagoda Model," Qwen3.7-Max's output speed and quality were also comprehensively superior.

Developer Paul Couvert even highly praised Qwen3.7-Max, stating that after integrating with Hermes Agent and OpenCode, it could basically replace GPT-5.5 and Opus 4.7.

Programming, A True Contender

However, scores are one thing; real-world testing is another.

We arranged a hardcore "Racing Game" challenge for Qwen3.7-Max.

With a detailed Prompt input, in no time, Qwen3.7-Max directly output a playable HTML file.

The first version had a small bug: the A/D steering keys were reversed.

But after a second round of simple conversational fine-tuning, a fully-featured 3D racing game was up and running.

The moment it opened, to be honest, was a bit of a shock.

Four cars racing together on a 3-lap circular track, over 100 coins scattered on the track, hitting obstacles causes slowdowns and loss of control.

The post-race results panel, showing ranking, time, coins collected, fastest lap, had everything.

But what was truly surprising were two details that only Qwen3.7-Max got right.

One was the start screen. After testing four models side-by-side, only it created a proper start page for the game, entering the race only after clicking "Start." The other three went straight into racing without even a title screen.

The other was sound effects. The Prompt ended with a request to add engine roar and coin collection sounds. Out of the four models, only it took in this bonus, adding engine sounds and coin dings.

Now let's look at the performance of the other contestants.

Gemini 3.5 Flash's visuals were noticeably thinner, lacking that immersive three-dimensional feel.

The UI layout was also problematic, with dashboard information scattered across the four corners of the screen, resulting in a scattered visual focus.

In contrast, Qwen3.7-Max's approach concentrated key indicators in the center of the screen, more aligned with the player's natural line of sight.

Claude Opus 4.6's result was somewhat... hard to describe.

Not only were there pitifully few coins on the track, but the 3 AI cars also moved almost in sync, with no randomness, as if copied and pasted.

Finally, GPT-5.5.

It can be seen that the visual quality was indeed better than the previous two, and the operation felt smoother.

But for some reason, coins were made into yellow "donuts"...

The shape is a minor issue. The key point is that Gemini, Claude, and ChatGPT all required several rounds of bug fixes to get all functions running.

Only Qwen3.7-Max's first-round generation was basically playable.

Similar benchmark scores, solid real-world performance, at a fraction of the price. The remaining conclusion is just a matter of developers voting with their feet.

The "Foundation" Model for the Agent Era

The reason Qwen3.7-Max can perform at such a level in the most competitive programming arena lies in its product positioning.

A few days ago, when Alibaba released Qwen3.7-Max, they gave it a very special label: Agent Foundation Model.

It was born to be a model designed for long-duration autonomous task execution.

Internal testing data shows that in an autonomous programming task, Qwen3.7-Max ran continuously for 35 hours, executing 1158 tool calls.

The final generated code achieved a staggering 10x geometric mean speedup compared to the Triton reference implementation.

Even more impressive is its "endurance" capability—

Even after 30 hours into the reasoning process, the model remained sharp, continuously uncovering new optimization spaces.

Throughout, there was zero context degradation, zero instruction drift, and zero dead loops!

It must be said, the difficulty isn't in the 1000 tool calls themselves. Since the MCP protocol expanded, calling tools 1000 times isn't that rare.

The difficulty lies in 35 hours of coherent reasoning.

Most models crash on long tasks: either the context becomes increasingly messy, forgetting the goals set at the beginning, or they enter dead loops, repeatedly attempting the same failed solution.

Qwen3.7-Max has made "continuously doing the right thing" a reality.

Revealing the Core Technology

We understand that this leap in programming for Qwen3.7-Max likely stems from upgrades in two key training methods.

First, Environmental Expansion.

During programming training for Qwen3.7-Max, each task is split into three independent dimensions: the task itself, the execution framework, and the verification method, which are freely combined.

The same problem might be solved within the Claude Code framework, sometimes in OpenClaw, and other times with a different verification method.

The effect is like an intern being rotated through all project teams. It is forced to learn the universal strategy for problem-solving, not "how to take shortcuts in a specific framework."

This explains a counterintuitive phenomenon: Qwen3.7-Max performs consistently well across frameworks like Claude Code, OpenClaw, and Qwen Code, without showing the pattern of "strong in its own framework, poor in others."

The second upgrade is, Long-Horizon Autonomous Execution.

In training, the team introduced a "Dynamic Accumulative Survival Game" framework.

This means making the model perform over a thousand steps of continuous decision-making in a continuously changing simulated environment, establishing its own hypotheses, adjusting strategies based on feedback, and avoiding "context corruption" from running too long.

Here's a telling data point: in the YC-Bench simulation of running a startup for a full year, Qwen3.7-Max achieved $2.08 million in revenue, double that of the previous generation ($1.05 million).

More crucially, it demonstrated strategic evolution: autonomously adjusting direction mid-term during a crisis, identifying and blocking malicious clients, eventually converging to a stable execution loop.

This is the underlying support for the 35-hour kernel optimization case and explains why on Kernel Bench L3, Qwen3.7-Max achieved speedup effects in 96% of scenarios.

And programming is just the first battlefield. This foundation of long-horizon reasoning and tool calling points to a greater ambition—a universal Agent foundation.

The Programming Finals Have a New Disruptor

Since its launch, Code Arena has always tested hard skills: multi-step reasoning, tool orchestration, complete project delivery—all real, Agent-level challenges.

Today, with a score of 1541 points, Qwen3.7-Max wedged itself into fourth place, positioned between Opus 4.6 Thinking and Opus 4.6.

On this track where Claude has dominated for over half a year, it has given its answer: Chinese models are not just followers; they can also be definers.

The global programming model competition is no longer a one-man show in Silicon Valley.

References:

https://arena.ai/leaderboard/code/webdev

This article is from the WeChat public account "AI Era Insights" (新智元), author: ASI启示录

Is LIT’s burn worth $42 million enough to spur the altcoin’s next big rally?

Lighter (LIT) has experienced significant volatility, rallying 18% over the past week but facing potential overbought conditions. After a warning of a correction toward $2, the price dipped to $2.3 before recovering to $2.60. A major catalyst was the recent burn of over 15.6 million LIT tokens (worth ~$42 million), representing 6.3% of the circulating supply, which may have spurred short-term bullish momentum toward $3. Technical analysis on the 1-day chart shows a bearish divergence with the RSI, signaling a potential pullback despite strong demand. The key Fibonacci retracement level to watch is $2.30. On the 4-hour chart, LIT is forming a range between $2.31 and $2.68. The suggested strategy is to wait for a breakout above $2.70 to target $3.06-$3.21, or a breakdown below $2.31 for a likely deeper retracement below $2. In summary, while bullish momentum persists, traders should monitor the range boundaries for the next directional move.

ambcrypto2h ago

Is LIT’s burn worth $42 million enough to spur the altcoin’s next big rally?

ambcrypto2h ago

Nearly a Hundred Players Rush into Embodied Data: With 4.47 Billion Yuan in Financing in One Year, Who Can Really Make Money by 'Selling Data'?

The domestic embodied AI data industry has attracted nearly 100 players, with 70 focused on data collection and 27 on data infrastructure. In the past year, 15 independent embodied data service providers raised approximately 4.47 billion yuan. Despite this growth, the sector remains early-stage, fragmented, and faces significant challenges. Data collection methods are diverse, categorized into four main routes: teleoperation of real robots, human demonstration without a robot (using motion capture, exoskeletons, etc.), simulation synthesis, and distillation from internet videos. Most companies (43%) adopt hybrid approaches, combining multiple routes, as no single method can meet all training needs. Teleoperation alone is pursued by 31% of players, often by state-owned platforms and robot companies, while newer firms favor asset-light, no-hardware human demonstration. Independent data service providers now form the largest player group (40%), indicating the emergence of a distinct industry segment rather than just a subsidiary function for robot makers. Two-thirds of all players are "embodied-native" startups, while one-third are companies that pivoted from fields like AI data annotation, which are more prevalent in the data infrastructure layer. Current annual industry capacity is estimated at 1.6-1.8 million hours plus 70-80 million data points, with a short-term goal to increase this 15-20 fold within 1-3 years. Data collection factories are spread across 20 provinces in China, concentrated in the Yangtze River Delta, Beijing-Tianjin-Hebei, and Pearl River Delta regions. Financially, the 4.47 billion yuan raised in the past year pales compared to the 43.8 billion yuan raised by the broader embodied intelligence sector in just the first half of 2026, highlighting that data remains a less "sexy" bet for investors. The 15 funded independent providers show clear stratification: a top tier led by a unicorn (Lightwheel Intelligence, 3.1 billion yuan), a middle tier of 11 firms raising tens to hundreds of millions, and an early-stage tier of 3 companies. Sixty-nine investment institutions have participated, but none have made concentrated bets, reflecting uncertainty about viable business models. Over half of these funded companies are less than a year old, most are at pre-A or A rounds, and profitability remains largely unproven. In summary, the embodied data industry has become an independent track creating jobs and local economic activity. However, it is still nascent, with unformed consensus, unsolved problems, and unproven business models. The coming 1-2 years will be a critical validation window to see if companies can build sustainable, profitable businesses purely by "selling data."

marsbit2h ago

Nearly a Hundred Players Rush into Embodied Data: With 4.47 Billion Yuan in Financing in One Year, Who Can Really Make Money by 'Selling Data'?

marsbit2h ago

Dialogue with Multicoin Partner: The Crypto Market Has Bottomed Out, Favoring Three Cryptocurrencies in This Cycle

In a recent interview, Multicoin Capital managing partner Tushar Jain shared his views on the crypto market. He believes the market has bottomed and is at an inflection point, citing that negative news no longer causes significant price declines and application adoption continues to grow. Jain remains highly bullish on Solana, viewing it as the correct architectural choice for internet capital markets, particularly for spot and tokenized security trading. He is also positive on Hyperliquid, noting its leadership in decentralized derivatives trading. His investment approach focuses on concentrating capital in top convictions rather than equal allocation. A distinct opportunity he highlights is Zcash (ZEC), which he sees as a return to the industry's cypherpunk ethos and a potential top-five asset by market cap. For assets like Zcash without cash flows, his valuation framework is based on relative market cap ranking. Regarding investment strategy, Jain employs a "three-part" entry method to avoid timing pitfalls and emphasizes long-term "active management" over "active trading." He outlines four sources of investment edge: informational, analytical, behavioral/psychological, and structural. On portfolio management, the fund uses Bitcoin as its "cash," selling assets into Bitcoin during market euphoria to reduce beta risk and using Bitcoin to buy dips. Sales occur only if a better opportunity arises, the investment thesis breaks, or valuations become excessively overheated. While respectful of Ethereum's resilience, he questions its unclear scaling roadmap. Finally, Jain reaffirms his commitment to the thesis that blockchains will form the foundational architecture for future capital markets.

marsbit3h ago

Dialogue with Multicoin Partner: The Crypto Market Has Bottomed Out, Favoring Three Cryptocurrencies in This Cycle

marsbit3h ago

Bitcoin nears cycle bottom despite record $8B Spot ETF outflows – Why?

Bitcoin (BTC) is recovering near $64,100 after falling below $60,000, but faces headwinds from Middle East tensions and persistent U.S. inflation concerns, keeping the Federal Reserve's interest rate at 3.50%-3.75%. According to CoinShares' James Butterfill, signs suggest Bitcoin may be nearing a cycle bottom despite a record 8-week streak of Spot Bitcoin ETF outflows totaling ~$8 billion. Recent small ETF inflows hint at easing institutional selling pressure. Notably, a large 3,588 BTC sell-off by a specific strategy in early July had minimal market impact. On-chain data shows new buyer support forming in the $60k-$63k range, while the $77k price level has shifted from support to a key resistance zone. Overall, the market is under pressure but not broken, with cautious sentiment prevailing as Bitcoin trades below major historical cost bases.

ambcrypto3h ago

Bitcoin nears cycle bottom despite record $8B Spot ETF outflows – Why?

ambcrypto3h ago

XRP price prediction: Are sidelined traders refusing to chase shallow bounces?

XRP's network activity has hit unusually low levels, with daily active addresses and network growth at significant lows for 2026 and since late 2024, respectively. This suggests sidelined traders are waiting for a substantial price move rather than chasing shallow bounces. Spot volume trends confirm muted buying pressure, with the spot Cumulative Volume Delta (CVD) in neutral territory and in decline since March. While XRP is flowing out of exchanges into accumulation, these outflows are less pronounced than in 2025. Despite a major price correction, bearish sentiment remains strong, as evidenced by sustained negative aggregate funding rates throughout 2026. This extreme bearish consensus could paradoxically signal a potential medium-term bullish reversal, similar to a pattern observed before a significant rally in April 2025. However, a major shift in spot volume is necessary for any sustained price recovery.

ambcrypto4h ago

XRP price prediction: Are sidelined traders refusing to chase shallow bounces?