OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

marsbit發佈於 2026-05-11更新於 2026-05-11

文章摘要

OpenAI engineer Weng Jiayi's "Heuristic Learning" experiments propose a new paradigm for Agentic AI, suggesting that intelligent agents can improve not just by training neural networks, but also by autonomously writing and refining code based on environmental feedback. In the experiment, a coding agent (powered by Codex) was tasked with developing and maintaining a programmatic strategy for the Atari game Breakout. Starting from a basic prompt, the agent iteratively wrote code, ran the game, analyzed logs and video replays to identify failures, and then modified the code. Through this engineering loop of "code-run-debug-update," it evolved a pure Python heuristic strategy that achieved a perfect score of 864 in Breakout and performed competitively with deep reinforcement learning (RL) algorithms in MuJoCo control tasks like Ant and HalfCheetah. This approach, termed Heuristic Learning (HL), contrasts with Deep RL. In HL, experience is captured in readable, modifiable code, tests, logs, and configurations—a software system—rather than being encoded solely into opaque neural network weights. This offers potential advantages in explainability, auditability for safety-critical applications, easier integration of regression tests to combat catastrophic forgetting, and more efficient sample use in early learning stages, as demonstrated in broader tests on 57 Atari games. However, the blog acknowledges clear limitations. Programmatic strategies struggle with tasks requiring long-...

Over the past decade, the advancement of AI has primarily relied on one path: feeding more data and computing power into larger models, allowing experience to accumulate within neural network parameters. This path has led to the leap in large models after ChatGPT, but it has also left behind a persistent challenge: as models become increasingly powerful, the reasons behind their successes and failures often remain difficult to explain and correct.

Recent experiments by OpenAI engineer Weng Jiayi suggest another possibility: within a clear objective, a runnable environment, and a feedback loop, AI can improve not only by training models but also by "autonomously modifying code."

On May 8, 2026, Weng Jiayi systematically documented this set of experiments in his personal blog "Learning Beyond Gradients" and simultaneously made the code repository, CSV experiment logs, and video replays public. He has long focused on reinforcement learning and post-training infrastructure, participated in the initial launch of ChatGPT, and contributed to projects like GPT-4, GPT-4 Turbo, GPT-4o, o-series, and GPT-5. Before joining OpenAI, he earned his bachelor's degree from the Department of Computer Science at Tsinghua University and his master's degree from Carnegie Mellon University. He is also a main author of the open-source reinforcement learning library Tianshou and the high-performance parallel environment engine EnvPool.

Image generated by AI

He had Codex repeatedly write policy code, run environments, read logs, review replays, locate failures, then modify code, add tests, and continue evaluation. After multiple iterations, Codex "cultivated" a set of pure Python programmatic strategies: it achieved a theoretical perfect score of 864 points in Atari Breakout and also produced results in robot control simulation environments like MuJoCo Ant and HalfCheetah that were close to those of common deep reinforcement learning algorithms.

The truly significant aspect of these experiments lies in a core question: When the coding agent is sufficiently capable, must learning necessarily occur within neural network weights?

In this experimental setup, experience is written into code, tests, logs, and replays, becoming a software system that can be read, modified, reviewed, and audited. If this direction continues to hold, the next step for Agentic AI might not only be training larger models but also enabling models to participate in maintaining a continuously evolving engineering system.

01 From 387 Points to a Perfect Score: An Engineering Loop

Weng Jiayi wrote in his blog that the starting point for this experiment was actually an engineering need. While maintaining EnvPool in his spare time, he required a cheaper method than "running a neural network every time" to test whether the game environment was functioning correctly, as placing neural networks in CI was too expensive. The original question was: Could he write cheap, reproducible, heuristic rules that were clearly better than a random policy, to drive the environment to information-rich states?

He used Codex (base model gpt-5.4) to attempt writing a completely rule-based version. The initial prompt was very direct: "Write a strategy that can solve Breakout." The result was unsatisfactory. A low score itself provided no information—the action semantics could be wrong, state detection could be wrong, the evaluation process could be wrong, or the policy structure itself could be too weak.

Subsequently, Weng Jiayi changed the task format. He no longer asked Codex to simply deliver a policy.py file; instead, he required it to maintain a complete loop: probe actions and observations, write state detectors, write the policy, run complete episodes, record trials.jsonl and summary.csv, generate videos or curves, inspect failure modes, modify the policy, simplify code, and run regressions.

The experimental log for Breakout clearly recorded this process. In the first round, Codex confirmed the action space and observation shape, identified the colors of the ball, paddle, and bricks from the RGB frames, and then used image labels to scan the 128-byte Atari RAM. The initial baseline scored only 99 points. After adding tunnel offset logic, the score increased to 387 points.

387 points was a deceptively high local optimum. The strategy could stably hit the ball, but the ball path was trapped in a periodic loop: no lives were lost, but no new bricks could be broken, and the score was stuck. If a human were writing the code, they might continue fine-tuning the "accuracy of hitting the ball." Codex watched the video and the last few dozen steps of the trajectory, and identified the problem as a lack of disturbance in the ball's path.

Image: Atari Breakout gameplay. The player controls the bottom paddle to bounce a ball, breaking layers of colored bricks above. Codex achieved the theoretical perfect score of 864 points in this game.

Codex then added a mechanism to "break the cycle": if no reward was received for a long time, periodically add an offset to the landing point prediction to knock the ball out of the local loop. The score jumped from 387 to 507. During further iterations, a new problem emerged: for fast low balls, conventional interception would cause the paddle to "over-lead" and drift away. Codex added a `fast_low_ball_lead_steps=3` parameter, and the score jumped from 507 to 839. The final improvement from 839 to 864 resembled maintaining an already complex system: trying deadband, serve offset, stuck offset, brick balance bias, lookahead steps; many directions were ineffective. The final useful change was a late-stage condition: "After the first wall of bricks is cleared, enable the stuck offset only when the ball is far from the paddle, and gradually release it when the ball is close."

The final RAM default configuration stably output 864 / 864 / 864 points across three episodes, reaching the theoretical limit of Breakout. Codex then migrated the same geometric controller to a pure vision input version—without reading RAM, relying solely on RGB segmentation to identify the paddle, ball, and brick balance. The vision version initially scored 310 points, then 428 points, and reached 864 points after the seventh local episode, corresponding to 14,504 local policy environment steps.

Image: Sample efficiency curve of Codex on Breakout. The blue line is the version that reads game memory (RAM), and the red line is the vision-only version (Vision). The RAM version experienced several jumps: 99 → 387 → 507 → 839 → 864, finally reaching the perfect score for the first time at episode 81, with a cumulative 1.5 million environment steps; the Vision version, migrating the mature structure from the RAM version, reached 864 points with only 7 episodes and approximately 14,500 environment steps.

Weng Jiayi specifically noted that this should not be understood as "the vision input started from scratch and reached a perfect score using only 14.5K steps." The actual process was that Codex first discovered the geometric controller, cycle-breaking mechanism, and late-stage offset release in the RAM version. Once the structure was stable, the state reading layer was switched from RAM to RGB. The 14.5K steps represent the migration budget for the vision version.

02 Defining Heuristic Learning

Finding a name for this evolving "software policy" was more difficult than writing the first version of the policy. Weng Jiayi ultimately named this process Heuristic Learning (HL) and termed the object it maintains as a Heuristic System (HS).

According to his blog definition, HL is composed of program code. Like today's common deep reinforcement learning, it has a loop of state, action, feedback, and update. The difference is that the object being updated is the software structure, not neural network parameters; its feedback, digested by the coding agent, can come from environmental rewards, test cases, logs, videos, replays, or human feedback; its update does not use backpropagation, but rather the coding agent directly edits the policy, state detectors, tests, configurations, or memories.

It should be added that the concept of "using programs rather than neural networks as policies" is not Weng Jiayi's original creation. Academic discussions on Programmatic RL have been ongoing for years: the PROPEL framework proposed by Rice University and Caltech in 2019 researched reinforcement learning methods representing policies as short programs in a symbolic language; the 2021 LEAPS work further learned program embedding spaces, combining differentiable program policies with RL training; the HPRL (Hierarchical Programmatic Reinforcement Learning) presented at ICML 2023 allows a meta-policy to combine multiple programs; the LLM-GS framework from National Taiwan University and Microsoft in 2024 uses LLM's programming ability and commonsense reasoning to guide the search for programmatic RL policies.

The consensus from this research is that, compared to neural policies, programmatic policies possess better interpretability, formal verifiability, and generalization ability to unseen scenarios.

Weng Jiayi's substantive contribution this time lies in treating the coding agent as the engineering channel for maintaining the heuristic system. In the past, doing programmatic RL either relied on manually designed domain-specific languages or search algorithms within restricted program spaces; Weng Jiayi, however, uses Codex to integrate code, logs, tests, video replays, and parameter adjustments into the same agent workflow, drastically reducing the iteration cost of program policies at once. In other words, he is arguing for a new engineering path: when the coding agent is sufficiently capable, those heuristic strategies once deemed "too expensive to maintain" might become cost-effective again.

Weng Jiayi provided a comparison table in his blog to clearly illustrate the differences between HL and Deep RL: in terms of policy form, the former consists of rules, state machines, controllers, model predictive control (MPC), and macro actions composed into code, while the latter consists of neural network parameters; in terms of state form, the former uses explicit variables, detectors, and caches, while the latter uses network-readable observation vectors; in terms of feedback form, the former treats tests, logs, and replays as valid signals, while the latter primarily relies on fixed reward functions; in terms of memory form, the former can explicitly store trials, summaries, failure reasons, and version diffs, while the latter has essentially none in on-policy algorithms and relies on replay buffers in off-policy algorithms.

This comparison demonstrates that HL possesses some engineering attributes: the policy is interpretable and can be translated into natural language; sample efficiency is measured in units of "one effective code change," not slow gradient updates; old capabilities can become regression tests, fixed-seed replays, or golden cases; overfitting to training seeds or test loopholes can be constrained through simplification, regression checks, and multi-seed evaluation; old capabilities don't have to reside solely in weights but can also reside in rule sets and tests, which partly addresses the catastrophic forgetting problem that neural networks have long struggled to solve.

03 Bulk Validation on Atari57: Boundaries and Shortcomings

If focusing only on Breakout, the story could easily be simplified to "AI wrote a perfect strategy." But Weng Jiayi didn't stop at Breakout; he scaled this Codex workflow in bulk to Atari57, running 57 games, two observation modes, and three repetitions each, totaling 342 "unattended" search trajectories.

The experimental design was quite rigorous. Each game was tested with two input methods: one directly reading game memory, and the other viewing only the screen. Each method was independently repeated three times. This produced a total of 342 "unattended" experimental trajectories: each Codex agent received the same prompt template, explored actions on its own, wrote code on its own, ran experiments on its own, and recorded results on its own, with no human providing hints. Constraints were strictly enforced: no training neural networks, no reading game source code, no exploiting any hidden information. All steps used for debugging and trial-and-error had to be counted in the total cost. This was to prevent Codex from cheating in any "peeking at the answers" way.

When measuring results, a metric called HNS (Human-Normalized Score) is commonly used—simply put, it standardizes the score of each game relative to "average human player performance = 1" for easy cross-game comparison.

Image: Sample efficiency comparison on the full Atari57 suite. The x-axis is environment steps (log scale), and the y-axis is HNS (Human-Normalized Score, where 1.0 indicates reaching average human player level). Codex's vision input version (red line) significantly outperforms the PPO baseline (blue/gray dashed lines) in early-stage efficiency, reaching 0.81 at 9.7 million steps, comparable to PPO's level around 10 million steps; Codex's memory input version (purple line) converges at 0.59.

Measured by this standard, Codex's early-stage efficiency appears quite impressive. With only 1 million environment steps consumed, Codex's median HNS for vision input had already reached 0.32, and for memory input, 0.26, significantly higher than that of classical reinforcement learning algorithms like PPO at the same stage. By 9.7 million steps, Codex's vision version reached 0.81, already close to PPO's level of approximately 0.88 to 0.92 at 10 million steps. If allowed to aggregate by selecting the better-performing input method for each game, Codex's median HNS was 0.83, OpenAI Baselines PPO2 was 0.80, and CleanRL EnvPool PPO was 0.98—essentially a tie.

However, Weng Jiayi himself calmly drew a boundary: this is only a comparison of environment interaction efficiency, without accounting for the costs of Codex reading logs, writing code, and watching videos. "Running fast" does not equal "low total cost," and the latter remains a black box for now.

More noteworthy is that Codex's performance across the 57 games was not uniform. In games with clear geometric structures like Breakout, Boxing, and Krull, both heuristic strategies and deep reinforcement learning could significantly surpass human levels; in games with clear rules like Asterix, Jamesbond, and Tennis, heuristic strategies were even stronger; but in fast-paced, complex-pattern games like Atlantis, VideoPinball, RoadRunner, and StarGunner, PPO still dominated.

The most cautionary counterexample is Montezuma's Revenge. This is a notorious "hard nut to crack" in reinforcement learning, where the protagonist needs to find keys, avoid enemies, and open doors in a complex underground labyrinth, with extremely sparse reward signals—a classic "long-term planning + failure recovery" challenge. Codex did score 400 points in this game, but examining the policy file it generated reveals that it's not a true "strategy" but a hardcoded sequence of 86 actions corresponding to 1,769 environment steps: more like memorizing a fixed route than learning to navigate a maze. Weng Jiayi specifically noted: "This is a boundary case and should not be understood as a generic Montezuma strategy."

Montezuma exposes the expressive limits of Heuristic Learning. Ordinary programmatic strategies are essentially reactive logic of "do this action when you see that state," struggling with tasks requiring strict action sequences, resuming plans from intermediate states, and long-horizon planning. Such tasks require not just more if-else statements but program structures closer to "macro-action combination + recoverable search state + long-term memory." This tells us one thing: even if the coding agent becomes very powerful, some problems cannot be contained by ordinary code.

04 If the Paradigm Holds, What Are the Industrial Implications?

Zooming out to an industrial perspective. If the Heuristic Learning path truly holds—meaning "coding agents can stably maintain programmatic strategies surpassing handcrafted rules and approaching RL baselines"—where does its practical significance lie?

The first application point is robot control, especially in structurally stable scenarios. The framework Weng Jiayi outlined in his blog involves hierarchical division of labor: joint-level HL, limb-level HL, full-body balance HL, and task-level HL. Lower levels handle safety and low-latency control, middle levels handle gait and contact, and higher levels handle tasks and long-term memory; the coding agent doesn't need to "understand walking"—it's more like an update channel inserted into the system, sending failure videos, sensor streams, and simulation results back to the system, and rewriting feedback into code, parameters, protection rules, and memories.

Scenarios like warehouse AGVs, inspection robots, factory robotic arms, and standardized sorting, where the environment structure is relatively fixed and safety boundaries are clear—if core control strategies can be solidified into lightweight code, robots wouldn't need to run a large policy network for every action step. Deployment-end reliance on high-power GPU inference cards would decrease, with more load handled by traditional controllers and local program logic.

This doesn't mean robots don't need GPUs; perception, localization, mapping, and semantic understanding still rely on neural networks. What changes is the role of the GPU, shifting from "burning compute for end-to-end action decisions every second" to "playing a periodic role in perception, offline simulation, policy generation, and anomaly analysis."

The second application point is the auditability of safety-critical scenarios. The most troublesome engineering problem with neural policies is the inability to locate the cause after a failure. When a robotic arm suddenly fails at a certain angle, a vehicle misjudges in an edge case, or a medical robot acts abnormally in a rare posture, engineers cannot answer "which weight caused this error." Ultimately, they can only add data, retrain, run regression tests, and bet that the new model hasn't introduced new problems.

If the policy exists in code form, state variables, conditional branches, failure logs, and regression tests are all visible; a dangerous action can be hardcoded to be prohibited, a corner case can be written as a test, and an erroneous state transition can be individually patched. This doesn't make the system inherently safer, but it allows safety issues to enter normal software engineering workflows for the first time—they can be code-reviewed, intercepted by CI, and responded to by SRE on-call. In fields requiring regulation and liability division, like autonomous driving, industrial robotic arms, and medical robots, this auditability itself is of commercial value.

The third application point is the engineering of continual learning and online learning. Weng Jiayi presented this as the main argument of the entire blog post. Catastrophic forgetting in neural networks is a structural problem: learning new things washes away old capabilities. HL also experiences forgetting, but in a more engineering form: a new rule fixes one failure mode but breaks an old scenario; a new memory repeatedly leads the agent in the wrong direction; a test range is too narrow, and the policy learns to exploit it; a patch modifies a shared interface, and old calling paths silently fail.

These problems don't disappear automatically, but they are issues that software engineering has dealt with for decades, with existing toolchains—regression testing, version diffs, fixed-seed replays, golden traces, and explicitly recorded failure directions.

A healthy HS must perform two operations simultaneously: absorbing new feedback and compressing historical patches. An HS that only grows without reduction will eventually become a "code ball of mud" no one dares to touch. In other words, HL transforms the mathematical problem of "how to update parameters" into the engineering problem of "how to maintain a software system that continuously absorbs feedback."

The latter is not necessarily easier, but it is closer to the existing boundaries of human capability.

The fourth application point is capability accumulation in Agent products. What current Agent products lack most are stable tool invocation, reliable execution chains, reusable failure experiences, and auditable task records. If HL's logic holds, an Agent's memories during execution would precipitate into code assets that can be reused across sessions, users, and tasks. It can directly interface with existing DevOps processes and also means that Agents from different companies and teams can share heuristics without needing to share models—something the neural network approach cannot achieve.

However, it must be emphasized: all four application points depend on further validation of the HL path on more complex tasks. Breakout and Ant are relatively clean environments. Real robots face changes in ground friction, lighting variations, actuator delays, and sensor noise—none of which have been systematically evaluated in public materials. The Montezuma counterexample has already shown that long-horizon tasks require program forms beyond ordinary if-else. How far this vision can go depends on the next phase of experiments.

05 Technical Debt Shifts from Weights to Code

Weng Jiayi's assessment in his blog is measured. He wrote that HL cannot accomplish everything neural networks can do; it is limited by what code can express, especially in complex perception and long-horizon generalization. With today's understanding, he cannot imagine an agent using pure Python code without any neural networks to solve ImageNet. The truly worthwhile question is how to combine neural networks with HL to jointly address Online Learning and Continual Learning.

The division of labor he proposes borrows from the System 1 / System 2 framework: specialized shallow neural networks take on part of System 1, responsible for fast perception, classification, and object state estimation; HL also takes on part of System 1, responsible for processing fresh data, rules, tests, replays, memories, safety boundaries, and local recovery; the LLM agent acts as System 2, providing feedback and improvement data to HL, and periodically extracting information from data generated by HL to update itself.

If deep learning over the past decade has proven that "experience can be compressed into weights," then the hypothesis Weng Jiayi proposes this time is another proposition: in the era of coding agents, experience might once again become readable, modifiable, testable software.

This article is from the WeChat public account "Tencent Technology," author: Xiao Jing, editor: Xu Qingyang

你可能也喜歡

比特币ETF发行机构预测每枚币价将达100万美元，资金流入加速

VanEck数字资产研究主管Matthew Sigel近期预测，比特币可能在五年内达到100万美元。他基于年轻一代投资者的持续需求、比特币采用曲线与主流文化融合的趋势，以及央行可能将比特币纳入储备的“大趋势”做出判断。目前比特币交易价格约为8.07万美元，这意味着未来五年需上涨约1140%。这一乐观预测的背景是，美国现货比特币ETF资金流入强劲。2026年4月，该类ETF净流入19.7亿美元，创年度新高；5月迄今净流入也达到12.5亿美元，持续的需求正重塑加密货币市场格局。其他机构也给出类似长期看涨观点。VanEck的研究报告基准情境预计，到2050年比特币可能达到290万美元；牛市情境下甚至可能升至5340万美元，前提是比特币能成为全球5%-10%贸易的结算货币，并占央行资产负债表的2.5%。Bitwise首席投资官Matt Hougan和Jan3 CEO Samson Mow也均认为，比特币突破百万美元是可能实现的。

bitcoinist14 分鐘前

bitcoinist14 分鐘前

比特币已实现市值回升至正值区域，市场重获力量

比特币价格在周日小幅反弹后重回8万美元关键点位上方，多个指标开始重新显现强势。其中，比特币已实现市值（Realized Cap）随着市场状况缓慢改善，近期已转为看涨信号。比特币重新燃起的看涨势头正逐渐体现在多个关键链上指标中，反映出市场动态的转变。比特币已实现市值目前显示出强势，随着市场情绪改善，已回升至正值区域。该指标通过计算已实现利润与已实现亏损的差值得出，反映了比特币市场创造或摧毁的价值。 CryptoQuant平台分析师Darkfost指出，该指标目前正显示复苏信号，这意味着资金正流入比特币。截至周日，比特币已实现市值已转正，增长率约为+0.25%。虽然增幅尚不显著，但这是在今年2月经历超过-2.6%的急剧下跌之后发生的。Darkfost认为，当前阶段代表了资产从“弱手”向“强手”的转移。与此同时，另一个关键指标比特币净已实现利润/亏损也已转为正值。这一变化表明，以盈利状态转移的代币数量超过了以亏损状态转移的数量，显示出市场信心和投资者情绪正在稳步改善。链上分析账户On-Chain Mind指出，该指标是五个多月以来首次转正。总体而言，这些链上指标的改善标志着市场正在经历一个修复过程，投资者情绪好转，资金开始回流。然而，这并不等同于直接进入牛市，趋势能否持续仍有待观察。

bitcoinist1 小時前

bitcoinist1 小時前

XDC Altcoin发生了什么，为何其热度刚刚超越比特币？

加密货币专家X Finance Bull指出，XDC最近在CoinMarketCap上超越比特币，成为过去七天内访问量最高的加密货币。他解释称，这一关注度激增并非偶然，而是因为XDC网络旨在解决2.5万亿美元的贸易融资缺口问题。该网络采用区块链技术替代了传统的纸质文件、人工验证和多日结算流程，具备2000 TPS、2秒最终确认、近乎零费用、KYC验证主节点以及与SWIFT相同的ISO 20022合规标准等特点。此外，BitGo为XDC网络提供合规托管服务，Liqi每日处理超1亿美元贸易融资额，新加坡TradeTrust利用其实现数字贸易文件合规。其他进展包括ComTech Gold推出符合伊斯兰教法的代币化黄金，AUDDapt在澳大利亚开展中小企业支付合作，以及USDC已桥接至该网络。美国SEC和CFTC已通过Token Taxonomy指引将其归类为数字商品。 XDC网络在1月完成了坎昆硬分叉，引入了EIP-1559等以太坊最新标准，并通过XDC 2.0实现了普林斯顿大学团队开发的拜占庭容错与监控功能。专家认为，尽管XDC目前市值约6.35亿美元，但面对数万亿美元的贸易金融市场，其约0.03美元的价格仍被低估。截至发稿，XDC价格约为0.03美元，24小时内上涨超7%。

bitcoinist3 小時前

bitcoinist3 小時前

BTC市场脉搏：第20周

比特币在过去一周从77,000美元高位震荡上行至82,000美元低位，买盘持续吸纳回调，尽管价格在局部高点附近动能有所减弱。现货CVD（累计成交量Delta）大幅上升，反映了强烈的看涨情绪和对价格上涨的高度信心。同时，现货交易量增加，表明近期的价格走势得到了更强投资者参与的推动。然而，价格动能的放缓指向更均衡的买卖压力，暗示市场可能进入一个稳定阶段。期货市场方面，风险偏好同样上升。期货未平仓合约增加，表明投机活动加剧和风险承担意愿增强；永续合约CVD飙升，显示持续的看涨动能。但多头资金费率下降，意味着空头兴趣抬头，看涨情绪可能正在减弱。期权市场对下行保护的需求下降，未平仓合约上升，表明市场预期转向中性偏多。然而，波动率利差大幅扩大，显示期权定价蕴含的风险显著高于已实现波动，反映出参与者中存在较高的不确定性。链上活动显著增强，每日活跃地址、实体调整后的转账量和总手续费收入均有所上升，指向用户参与度提高和网络活动增加。与此同时，流动性状况持续稳定，短期投机资本的减少降低了即时卖压，而已实现市值变化则显示适度的净资本流入。盈利能力指标也有所改善，市场从未实现亏损重回盈利状态。然而，处于盈利状态的供应百分比仍低于通常与大规模获利了结相关的水平，表明市场乐观情绪依然克制而非狂热。总结来说，比特币的市场结构继续改善，得到更强的链上活动、更健康的盈利能力和更稳定的持有者仓位的支持。虽然看涨基调正在形成，但较温和的资本流入和谨慎的市场情绪表明，市场对风险偏好的变化依然敏感。

insights.glassnode4 小時前

insights.glassnode4 小時前

IREN这公司疯了：卖矿机，买GPU，股价涨了16%

5月8日，IREN发布财报后股价早盘大涨16%。原因并非比特币上涨，而是公司正在主动拆除矿场——将5800台比特大陆S21 Pro矿机下架并标为待售，同时计提1.4亿美元资产减值。与之形成对比的是，公司同期签署了与英伟达的五年34亿美元合作协议，并获得英伟达最高21亿美元的股权认购承诺。此外，IREN还持有与微软的97亿美元GPU云服务订单。为支撑AI算力转型，IREN近期收购了西班牙数据中心开发商Nostrum以及云软件公司Mirantis，并获摩根大通36亿美元信贷支持。公司目前比特币持仓为零，每日挖出的币全部卖出。管理层目标是在2026年底前实现480兆瓦AI容量、15万块GPU上线，并达成37亿美元年经常性收入。此举反映了北美比特币矿业的一个趋势：矿机价值下滑，GPU算力需求上升。多数矿企选择“挖矿+AI”并行，而IREN则彻底转向，拆除矿机、清零比特币持仓，全力押注AI。公司高管称AI算力市场“供应极度短缺”，但这与昔日比特币挖矿行业的叙事相似。算力始终流向回报最高的领域，而非固定于某种叙事。

marsbit4 小時前

marsbit4 小時前

交易

現貨

合約

熱門文章

什麼是 GROK AI

Grok AI: 在 Web3 時代革命性改變對話技術介紹在快速演變的人工智能領域，Grok AI 作為一個值得注意的項目脫穎而出，橋接了先進技術與用戶互動的領域。Grok AI 由 xAI 開發，該公司由著名企業家 Elon Musk 領導，旨在重新定義我們與人工智能的互動方式。隨著 Web3 運動的持續蓬勃發展，Grok AI 旨在利用對話 AI 的力量回答複雜的查詢，為用戶提供不僅具資訊性而且具娛樂性的體驗。 Grok AI 是什麼？ Grok AI 是一個複雜的對話 AI 聊天機器人，旨在與用戶進行動態互動。與許多傳統 AI 系統不同，Grok AI 接納更廣泛的查詢，包括那些通常被視為不恰當或超出標準回應的問題。該項目的核心目標包括：可靠推理：Grok AI 強調常識推理，根據上下文理解提供邏輯答案。可擴展監督：整合工具協助確保用戶互動既受到監控又優化質量。正式驗證：安全性至關重要；Grok AI 採用正式驗證方法來增強其輸出的可靠性。長上下文理解：該 AI 模型在保留和回憶大量對話歷史方面表現出色，促進有意義且具上下文意識的討論。對抗魯棒性：通過專注於改善其對操控或惡意輸入的防禦，Grok AI 旨在維護用戶互動的完整性。總之，Grok AI 不僅僅是一個信息檢索設備；它是一個沉浸式的對話夥伴，鼓勵動態對話。 Grok AI 的創建者 Grok AI 的腦力來源無疑是 Elon Musk，這個名字與各個領域的創新息息相關，包括汽車、太空旅行和技術。在專注於以有益方式推進 AI 技術的 xAI 旗下，Musk 的願景旨在重塑對 AI 互動的理解。其領導力和基礎理念深受 Musk 推動技術邊界的承諾影響。 Grok AI 的投資者雖然有關支持 Grok AI 的投資者的具體細節仍然有限，但公開承認 xAI 作為該項目的孵化器，主要由 Elon Musk 本人創立和支持。Musk 之前的企業和持股為 Grok AI 提供了強有力的支持，進一步增強了其可信度和增長潛力。然而，目前有關支持 Grok AI 的其他投資基金或組織的信息尚不易獲得，這標誌著未來潛在探索的領域。 Grok AI 如何運作？ Grok AI 的運作機制與其概念框架一樣創新。該項目整合了幾種尖端技術，以促進其獨特的功能：強大的基礎設施：Grok AI 使用 Kubernetes 進行容器編排，Rust 提供性能和安全性，JAX 用於高性能數值計算。這三者確保了聊天機器人的高效運行、有效擴展和及時服務用戶。實時知識訪問：Grok AI 的一個顯著特點是其通過 X 平台（以前稱為 Twitter）訪問實時數據的能力。這一能力使 AI 能夠獲取最新信息，從而提供及時的答案和建議，而其他 AI 模型可能會錯過這些信息。兩種互動模式：Grok AI 為用戶提供“趣味模式”和“常規模式”之間的選擇。趣味模式允許更具玩樂性和幽默感的互動風格，而常規模式則專注於提供精確和準確的回應。這種多樣性確保了根據不同用戶偏好量身定制的體驗。總之，Grok AI 將性能與互動相結合，創造出既豐富又娛樂的體驗。 Grok AI 的時間線 Grok AI 的旅程標誌著反映其發展和部署階段的關鍵里程碑：初始開發：Grok AI 的基礎階段持續了約兩個月，在此期間進行了模型的初步訓練和微調。 Grok-2 Beta 發布：在一個重要的進展中，Grok-2 beta 被宣布。這一版本推出了兩個版本的聊天機器人——Grok-2 和 Grok-2 mini，均具備聊天、編碼和推理的能力。公眾訪問：在其 beta 開發之後，Grok AI 向 X 平台用戶開放。那些通過手機號碼驗證並活躍至少七天的帳戶可以訪問有限版本，使這項技術能夠接觸到更廣泛的受眾。這一時間線概括了 Grok AI 從創建到公眾參與的系統性增長，強調其對持續改進和用戶互動的承諾。 Grok AI 的主要特點 Grok AI 包含幾個關鍵特點，促成其創新身份：實時知識整合：訪問當前和相關信息使 Grok AI 與許多靜態模型區別開來，從而提供引人入勝和準確的用戶體驗。多樣化的互動風格：通過提供不同的互動模式，Grok AI 滿足各種用戶偏好，邀請創造力和個性化的對話。先進的技術基礎：利用 Kubernetes、Rust 和 JAX 為該項目提供了堅實的框架，以確保可靠性和最佳性能。倫理話語考量：包含圖像生成功能展示了該項目的創新精神。然而，它也引發了有關版權和尊重可識別人物描繪的倫理考量——這是 AI 社區內持續討論的議題。結論作為對話 AI 領域的先驅，Grok AI 概括了數字時代轉變用戶體驗的潛力。由 xAI 開發，並受到 Elon Musk 願景的驅動，Grok AI 將實時知識與先進的互動能力相結合。它努力推動人工智能能夠達成的界限，同時保持對倫理考量和用戶安全的關注。 Grok AI 不僅體現了技術的進步，還體現了 Web3 環境中新對話範式的出現，承諾以靈活的知識和玩樂的互動吸引用戶。隨著該項目的持續演變，它成為技術、創造力和類人互動交匯處所能實現的見證。

634 人學過發佈於 2024.12.26更新於 2024.12.26

什麼是 ERC AI

Euruka Tech：$erc ai 及其在 Web3 中的雄心概述介紹在快速發展的區塊鏈技術和去中心化應用的環境中，新項目頻繁出現，每個項目都有其獨特的目標和方法論。其中一個項目是 Euruka Tech，該項目在加密貨幣和 Web3 的廣闊領域中運作。Euruka Tech 的主要焦點，特別是其代幣 $erc ai，是提供旨在利用去中心化技術日益增長的能力的創新解決方案。本文旨在提供 Euruka Tech 的全面概述，探索其目標、功能、創建者的身份、潛在投資者以及它在更廣泛的 Web3 背景中的重要性。 Euruka Tech, $erc ai 是什麼？ Euruka Tech 被描述為一個利用 Web3 環境提供的工具和功能的項目，專注於在其運作中整合人工智能。雖然有關該項目框架的具體細節仍然有些模糊，但它旨在增強用戶參與度並自動化加密空間中的流程。該項目的目標是創建一個去中心化的生態系統，不僅促進交易，還通過人工智能整合預測功能，因此其代幣被命名為 $erc ai。其目的是提供一個直觀的平台，促進更智能的互動和高效的交易處理，並在不斷增長的 Web3 領域中發揮作用。 Euruka Tech, $erc ai 的創建者是誰？目前，關於 Euruka Tech 背後的創建者或創始團隊的信息仍然不明確且有些模糊。這一數據的缺失引發了擔憂，因為了解團隊背景通常對於在區塊鏈行業建立信譽至關重要。因此，我們將這些信息歸類為未知，直到具體細節在公共領域中公開。 Euruka Tech, $erc ai 的投資者是誰？同樣，關於 Euruka Tech 項目的投資者或支持組織的識別在現有研究中並未明確提供。對於考慮參與 Euruka Tech 的潛在利益相關者或用戶來說，來自知名投資公司的財務合作或支持所帶來的保證是至關重要的。沒有關於投資關係的披露，很難對該項目的財務安全性或持久性得出全面的結論。根據所找到的信息，本節也處於未知的狀態。 Euruka Tech, $erc ai 如何運作？儘管缺乏有關 Euruka Tech 的詳細技術規範，但考慮其創新雄心是至關重要的。該項目旨在利用人工智能的計算能力來自動化和增強加密貨幣環境中的用戶體驗。通過將 AI 與區塊鏈技術相結合，Euruka Tech 旨在提供自動交易、風險評估和個性化用戶界面等功能。 Euruka Tech 的創新本質在於其目標是創造用戶與去中心化網絡所提供的廣泛可能性之間的無縫連接。通過利用機器學習算法和 AI，它旨在減少首次用戶的挑戰，並簡化 Web3 框架內的交易體驗。AI 與區塊鏈之間的這種共生關係突顯了 $erc ai 代幣的重要性，成為傳統用戶界面與去中心化技術的先進能力之間的橋樑。 Euruka Tech, $erc ai 的時間線不幸的是，由於目前有關 Euruka Tech 的信息有限，我們無法提供該項目旅程中主要發展或里程碑的詳細時間線。這條時間線通常對於描繪項目的演變和理解其增長軌跡至關重要，但目前尚不可用。隨著有關顯著事件、合作夥伴關係或功能添加的信息變得明顯，更新將無疑增強 Euruka Tech 在加密領域的可見性。關於其他 “Eureka” 項目的澄清值得注意的是，多個項目和公司與 “Eureka” 共享類似的名稱。研究已經識別出一些倡議，例如 NVIDIA Research 的 AI 代理，專注於使用生成方法教導機器人複雜任務，以及 Eureka Labs 和 Eureka AI，分別改善教育和客戶服務分析中的用戶體驗。然而，這些項目與 Euruka Tech 是不同的，不應與其目標或功能混淆。結論 Euruka Tech 及其 $erc ai 代幣在 Web3 領域中代表了一個有前途但目前仍不明朗的參與者。儘管有關其創建者和投資者的細節仍未披露，但將人工智能與區塊鏈技術相結合的核心雄心仍然是關注的焦點。該項目在通過先進自動化促進用戶參與方面的獨特方法，可能會使其在 Web3 生態系統中脫穎而出。隨著加密市場的持續演變，利益相關者應密切關注有關 Euruka Tech 的進展，因為文檔創新、合作夥伴關係或明確路線圖的發展可能在未來帶來重大機會。當前，我們期待更多實質性見解的出現，以揭示 Euruka Tech 的潛力及其在競爭激烈的加密市場中的地位。

553 人學過發佈於 2025.01.02更新於 2025.01.02

什麼是 DUOLINGO AI

DUOLINGO AI：將語言學習與Web3及AI創新結合在科技重塑教育的時代，人工智能（AI）和區塊鏈網絡的整合預示著語言學習的新前沿。進入DUOLINGO AI及其相關的加密貨幣$DUOLINGO AI。這個項目旨在將領先語言學習平台的教育優勢與去中心化的Web3技術的好處相結合。本文深入探討DUOLINGO AI的關鍵方面，探索其目標、技術框架、歷史發展和未來潛力，同時保持原始教育資源與這一獨立加密貨幣倡議之間的清晰區分。 DUOLINGO AI概述 DUOLINGO AI的核心目標是建立一個去中心化的環境，讓學習者可以通過實現語言能力的教育里程碑來獲得加密獎勵。通過應用智能合約，該項目旨在自動化技能驗證過程和代幣分配，遵循強調透明度和用戶擁有權的Web3原則。該模型與傳統的語言習得方法有所不同，重點依賴社區驅動的治理結構，讓代幣持有者能夠建議課程內容和獎勵分配的改進。 DUOLINGO AI的一些顯著目標包括：遊戲化學習：該項目整合區塊鏈成就和非同質化代幣（NFT）來表示語言能力水平，通過引人入勝的數字獎勵來激發學習動機。去中心化內容創建：它為教育者和語言愛好者提供了貢獻課程的途徑，促進了一個有利於所有貢獻者的收益共享模型。 AI驅動的個性化：通過採用先進的機器學習模型，DUOLINGO AI個性化課程以適應個別學習進度，類似於已建立平台中的自適應功能。項目創建者與治理截至2025年4月，$DUOLINGO AI背後的團隊仍然是化名的，這在去中心化的加密貨幣領域中是一種常見做法。這種匿名性旨在促進集體增長和利益相關者的參與，而不是專注於個別開發者。部署在Solana區塊鏈上的智能合約註明了開發者的錢包地址，這表明對於交易的透明度的承諾，儘管創建者的身份未知。根據其路線圖，DUOLINGO AI旨在演變為去中心化自治組織（DAO）。這種治理結構允許代幣持有者對關鍵問題進行投票，例如功能實施和財庫分配。這一模型與各種去中心化應用中社區賦權的精神相一致，強調集體決策的重要性。投資者與戰略夥伴關係目前，沒有與$DUOLINGO AI相關的公開可識別的機構投資者或風險投資家。相反，該項目的流動性主要來自去中心化交易所（DEX），這與傳統教育科技公司的資金策略形成鮮明對比。這種草根模型表明了一種社區驅動的方法，反映了該項目對去中心化的承諾。在其白皮書中，DUOLINGO AI提到與未具名的「區塊鏈教育平台」建立合作，以豐富其課程提供。雖然具體的合作夥伴尚未披露，但這些合作努力暗示了一種將區塊鏈創新與教育倡議相結合的策略，擴大了對多樣化學習途徑的訪問和用戶參與。技術架構 AI整合 DUOLINGO AI整合了兩個主要的AI驅動組件，以增強其教育產品：自適應學習引擎：這個複雜的引擎從用戶互動中學習，類似於主要教育平台的專有模型。它動態調整課程難度，以應對特定學習者的挑戰，通過針對性的練習加強薄弱環節。對話代理：通過使用基於GPT-4的聊天機器人，DUOLINGO AI為用戶提供了一個參與模擬對話的平台，促進更互動和實用的語言學習體驗。區塊鏈基礎設施建立在Solana區塊鏈上的$DUOLINGO AI利用了一個全面的技術框架，包括：技能驗證智能合約：此功能自動向成功通過能力測試的用戶頒發代幣，加強了對真實學習成果的激勵結構。 NFT徽章：這些數字代幣標誌著學習者達成的各種里程碑，例如完成課程的一部分或掌握特定技能，允許他們以數字方式交易或展示自己的成就。 DAO治理：持有代幣的社區成員可以通過對關鍵提案進行投票來參與治理，促進一種鼓勵課程提供和平台功能創新的參與文化。歷史時間線 2022–2023：概念化 DUOLINGO AI的基礎工作始於白皮書的創建，強調了語言學習中的AI進步與區塊鏈技術去中心化潛力之間的協同作用。 2024：Beta發佈限量的Beta版本推出了流行語言的課程，作為項目社區參與策略的一部分，獎勵早期用戶以代幣激勵。 2025：DAO過渡在4月，進行了完整的主網發佈，並開始流通代幣，促使社區討論可能擴展到亞洲語言和其他課程開發的問題。挑戰與未來方向技術障礙儘管有雄心勃勃的目標，DUOLINGO AI面臨著重大挑戰。可擴展性仍然是一個持續的擔憂，特別是在平衡與AI處理相關的成本和維持響應靈敏的去中心化網絡方面。此外，在去中心化的提供中確保內容創建和審核的質量，對於維持教育標準來說也帶來了複雜性。戰略機會展望未來，DUOLINGO AI有潛力利用與學術機構的微證書合作，提供區塊鏈驗證的語言技能認證。此外，跨鏈擴展可能使該項目能夠接觸到更廣泛的用戶基礎和其他區塊鏈生態系統，增強其互操作性和覆蓋範圍。結論 DUOLINGO AI代表了人工智能和區塊鏈技術的創新融合，為傳統語言學習系統提供了一種以社區為中心的替代方案。儘管其化名開發和新興經濟模型帶來某些風險，但該項目對遊戲化學習、個性化教育和去中心化治理的承諾為Web3領域的教育技術指明了前進的道路。隨著AI的持續進步和區塊鏈生態系統的演變，像DUOLINGO AI這樣的倡議可能會重新定義用戶與語言教育的互動方式，賦能社區並通過創新的學習機制獎勵參與。

564 人學過發佈於 2025.04.11更新於 2025.04.11

OpenAI Post-Training Engineer Weng Jiayi Proposes a New Paradigm Hypothesis for Agentic AI

文章摘要

01

From 387 Points to a Perfect Score: An Engineering Loop

02

Defining Heuristic Learning

03

Bulk Validation on Atari57: Boundaries and Shortcomings

04

If the Paradigm Holds, What Are the Industrial Implications?

05

Technical Debt Shifts from Weights to Code

相關問答