Codex Goal Mode Usage Guide: How to Make AI Continuously Pursue a Specific Objective

marsbitPublished on 2026-06-06Last updated on 2026-06-06

Abstract

"Codex Goal Mode: How to Make AI Work Continuously Toward a Specific Goal" OpenAI's Codex "goal mode" (/goal) transforms the AI from a reactive code assistant into a proactive execution agent capable of working autonomously for hours or even days to achieve a defined objective. To maximize its effectiveness, follow these key principles: 1. **Define Clear, Verifiable Exit Criteria:** The goal prompt should be a concise, measurable success condition, not a lengthy specification. Use quantifiable metrics like "reduce build time by 30%" or "achieve 100% test parity." 2. **Provide Initial Guidance and Tools:** Direct Codex toward likely problem areas and specify available tools (e.g., browsers, testing environments) to prevent it from exploring unproductive paths. 3. **Enable Progress Measurement:** Equip Codex with ways to track advancement, such as creating comparison tools for visual tasks or evaluation sets, ensuring it can gauge its own progress. 4. **Use a Realistic Execution Environment:** For tasks like performance optimization, provide access to environments that closely mimic production (e.g., similar configs, databases) to yield valid results. 5. **Be Cautious with Visual Goals:** Avoid vague "pixel-perfect" instructions. Instead, supplement visual references with functional checklists or design system specifications to prevent Codex from obsessing over minor details. 6. **Implement Progress Tracking:** For long-running tasks, have Codex commit code to draft PRs...

Editor's Note: This article is from Dominik Kundel, a member of OpenAI's Developer Relations team, summarizing experience using the Codex "goal mode / /goal" feature. It discusses not an ordinary prompt technique, but a role shift happening in AI programming tools: Codex is no longer just a code assistant responding to single-turn instructions, but is beginning to become an execution-type Agent that can continuously advance around a clear goal.

In /goal mode, what truly matters is not writing longer and more detailed requirements, but setting clear, verifiable exit criteria for Codex. Examples include "reduce deployment time by 30%," "achieve 100% parity in test coverage," "lower LCP below 2.5 seconds." These metrics allow Codex to judge whether the task is complete, also preventing it from endless trial and error in vague objectives. Meanwhile, users need to provide sufficient direction, tools, and a real environment, enabling Codex to measure progress and verify results, rather than just completing a seemingly feasible solution locally or under hypothetical conditions.

The article particularly warns that visual tasks are most likely to trap Codex in a quagmire of details. Instead of demanding "100% pixel-perfect replication," it's better to break down visual objectives into functional checklists, design system specifications, and evaluable metrics. For long-term tasks spanning hours or even days, continuous tracking is also needed via commits, draft PRs, progress documents, Slack updates, or side chats to avoid ending up with a pile of untraceable changes.

The informational value of this article lies in redefining /goal as a "long-term task management mechanism." When AI can execute continuously for dozens or even hundreds of hours, the developer's core competency also shifts: not just making AI generate code, but defining goals for it, establishing measurement systems, configuring execution environments, and finally performing review and reflection. In other words, AI programming is moving from "writing prompts" to "managing a continuously working engineering executor."

The following is the original text:

We launched goal mode (or /goal) to help you get Codex to continuously advance towards a specific outcome. After you set a goal, Codex will keep working until the goal is achieved—whether that takes a few hours or a few days. People have already had Codex work on the same goal for over 120 consecutive hours.

Goal mode is extremely powerful. To maximize its effectiveness, here are 7 things worth noting when using /goal.

Set Clear, Verifiable Criteria

The prompt you enter when activating goal mode serves both as the initial prompt and, more importantly, as the exit criteria for this goal. After each round of work, Codex will check: has this goal been completed?

Therefore, your goal prompt shouldn't be overly long but should focus on a clear criterion: under what conditions can this goal be considered achieved.

In most cases, a good goal is best to include a specific numerical metric for the model to judge completion. For example:

"Reduce build and deployment time by 30%."

"Migrate this feature from TypeScript to Rust, achieving 100% test parity."

"Optimize the application scaffolding so that the Largest Contentful Paint (a metric measuring the speed of loading the main page content) in production is below 2.5 seconds."

This prompt doesn't always need to include numbers, but generally, numbers make subsequent steps easier to advance.

If you're still unsure how to define the goal, or want to brainstorm the project with Codex first, you don't have to start the conversation with goal mode from the beginning.

Codex can set its own goals. You can start a normal conversation first, and when you're ready for Codex to start executing, then have Codex set the goal based on the preceding discussion.

You can also edit the goal at any time: click the edit button in the Codex app, or use /goal again in the CLI.

Provide Guidance Whenever Possible

A prompt like "Reduce build and deployment time by 30%" sounds cool and might lead Codex to some creative solutions. But if you already have a rough idea of where the problem might lie, this kind of prompt could also send Codex down the wrong path.

So, whenever possible, it's best to tell Codex where to start investigating, which tools can be used to accomplish the goal, or give other hints to prevent it from heading in the wrong direction.

For example, my colleague @reach_vb did this in an experiment: he told Codex it could use the Chrome browser to access Google Colab and explained some acceptable constraints, such as allowing it to generate its own dataset when training models.

Similarly, if you want to shorten build times and already know where most of the time is being spent, it's best to point Codex to that area in the prompt first.

Another approach is to let Codex do some preliminary research in plan mode first and have it create a plan file to record potential solutions. Then, have your goal reference this plan.

Make Progress Measurable

If your goal is ambitious, or Codex has many ways to gradually approach the goal, then it's important: you need to provide Codex with the tools to measure progress.

For some tasks, this might be inherently true. For example, optimizing build times, improving test coverage, because Codex can usually already use the relevant tools or will naturally create these tools.

But for other goals, you're better off brainstorming with Codex first: which tools help judge progress? Or give it some hints about how it can confirm whether it's moving towards the goal. For instance, creating a visual diff tool for two screenshots, or creating an evaluation set for the agent you're debugging.

I once asked Codex to recreate some components based on a video, and Codex created a tool for itself to compare screenshots and check for differences. Later, it continuously iterated on this tool, adding different diff modes.

Depending on the task, you also need to consider whether there are additional criteria that need to be measured or checked. Otherwise, Codex might think the task is complete, but from your perspective, it's not actually finished.

For example, Codex might directly crop the design reference image and embed it into the page to achieve "pixel-perfect" replication of a certain UI; or it might reduce the test coverage to make the test pass rate reach 100%. These are not the completion methods you actually want.

Create a Realistic Environment

If you want Codex to make truly effective progress towards a goal, it needs to run in a sufficiently realistic environment.

In practice, this means: if you want to optimize deployment time or latency issues, Codex should have access to deployment and testing environments, and these environments should simulate production as closely as possible. That is, using the same tech stack, the same configuration flags, and similar databases.

For example, we once debugged optimization for build and deployment times on developers.openai.com. At that time, we were already using deployment previews, so Codex could use these preview environments for deployment and view related logs. However, the problem was that our preview deployments, compared to the full production environment, had disabled some build paths.

Therefore, Codex ultimately had to perform manual deployments, deploying the code to environments closer to production configuration, to truly check the issues.

Similarly, you can also let Codex use computer use (the ability for the model to operate real application interfaces) to test actual applications. To optimize some performance issues on iOS, @dimillian even used physical devices to obtain the most accurate testing environment.

Set Visual Goals Cautiously

Giving Codex a visual goal, like "100% pixel-perfect replication of this UI based on this image," is indeed tempting. But depending on the specific setup, it can also cause trouble.

If you don't provide proper guidance and constraints, Codex might get bogged down in certain details, neglecting the overall goal. For example, if the reference image contains some graphic elements, and you expect Codex to generate these elements—whether SVG icons or images—it might spend a lot of effort on "how to precisely replicate these assets," rather than correctly breaking down the entire problem.

Additionally, Codex needs tools to perform visual comparisons correctly. This means more image input, higher overall token consumption, but doesn't necessarily provide Codex with a simple way to identify truly valuable improvement opportunities.

Therefore, images are usually better suited as goal context, not the sole completion criterion. You should find other ways for Codex to judge whether the goal has been achieved, such as functional checklists, implementation specifications, compliance with design systems, etc.

Track Progress

If Codex ends up working in the background for hours or even days, perhaps even running on another machine, it's easy to forget exactly where it has progressed and what work has already been done.

Depending on the goal, I've found the following methods helpful:

· Have Codex commit code at key milestones and push to a draft PR. This is especially useful when you're working on a website and have preview deployments.

· Have Codex update a deliverable for management. It could be an HTML file you keep open in the in-app browser; a page deployed via Sites for team viewing; a rendered progress chart, or just a regular Markdown file.

· Instruct Codex to actively publish progress updates. You can also write this into the goal: have Codex send updates to a Slack channel or other places where you want to log progress when significant progress is made.

· Use other chat windows to inquire about status. If you just want a quick overview of the current state, you can run /side to start a new side chat and ask questions there. Because it forks from the current thread, it has all the context up to that point but has a short lifespan.

· Another alternative in the Codex app is: open a regular new chat, have Codex read another goal thread, and answer your questions. This method is especially powerful if you have Codex set up an automated task to regularly check progress.

Clean Up and Finalize the Result

Great, the goal is finally completed! Now can you just toss the result to the team and call it a day?

Usually, especially in optimization tasks, I find it helpful to have Codex review and reflect on the work it has done. You can first run a local code review with /review, but it's also worth having Codex reflect more deeply: What paths did it attempt to achieve the goal? Which attempts were effective? Which were ineffective? Then clean up the code accordingly.

Because Codex will keep working until the goal is reached, it might have tried methods that weren't good enough or even completely ineffective, and these leftover changes might still be in the final code.

Set a goal for your next task too

Codex's goal feature is an incredibly powerful tool that can help you solve some of the most meaningful engineering challenges. But only when you provide the right environment and instructions can it reach the goal more efficiently.

What have you done with /goal?

Related Questions

QWhat is the most important factor in setting a goal for Codex's goal mode according to the article?

AThe most important factor is setting a clear, verifiable exit criterion, such as a quantifiable metric, to allow Codex to determine when the goal is complete.

QWhat are two specific examples of good, measurable goals provided in the article for using /goal mode?

ATwo examples are: 'Reduce build and deployment time by 30%' and 'Achieve 100% test parity when migrating a feature from TypeScript to Rust.'

QWhy should visual goals like 'pixel-perfect UI recreation' be set cautiously with /goal mode?

ABecause Codex can get stuck on minute details, consume excessive tokens, and lack proper tools for effective visual comparison, making it better to use functional checklists or design system conformance as success criteria.

QWhat are two methods suggested in the article for tracking progress during a long-running /goal task?

ATwo methods are: having Codex commit code and create draft Pull Requests, and instructing Codex to post progress updates to a Slack channel or other communication platform.

QWhat final step is recommended after Codex completes a /goal, especially for optimization tasks?

AIt is recommended to have Codex review and reflect on its work, using tools like /review, to identify effective and ineffective attempts and clean up any residual code from unsuccessful paths.

Related Reads

It Took Me a Year to See the Hard Truth About Agent Payments

**Title: It Took Me a Year to See the Hard Truth About Agent Payments** Over the past year, I've worked on infrastructure for the Agent economy, engaging with major players like Stripe, Visa, Coinbase, and numerous startups. The findings reveal a stark reality: genuine, widespread demand for Agent-based payments does not yet exist. **Key Observations:** * **Agent-to-Merchant (Shopping):** The user experience for AI shopping often falls short, especially for visual product discovery. While AI excels at understanding needs, conversational interfaces can't yet replace browsing and comparing multiple products visually. Current merchant interest is largely defensive ("Agent Engine Optimization") for a future that hasn't arrived. High-frequency, low-friction purchases (like food delivery) are potential fits, but lack open APIs and face high AI inference costs. Simpler, more affordable, or cross-language interactions for complex UIs are a niche opportunity but require massive consumer distribution to scale. * **Agent-to-API (Developer Tools):** Developer payment needs for APIs (computing, data, models) are already met through subscriptions and prepaid credits. The core challenge is not payment friction but supplier economics: most large SaaS providers prefer enterprise contracts over micropayments for API calls. Protocols like MPP and x402 suit the long-tail of smaller services but cater to a developer market historically reluctant to pay for these tools. Major infrastructure needs at the top of the stack are already being addressed. * **Agent-to-Agent (Machine Commerce):** This is a long-term vision with almost no current transaction volume. While a future with high-speed, high-frequency, multi-party machine-to-machine transactions would require novel infrastructure, it remains theoretical. The market is not here yet. * **Agent-to-Finance:** This is the only category with clear, present demand. Financial professionals and DeFi users already pay for tools, and AI augmentation is a natural evolution. Autonomous AI agents can enable entirely new financial strategies. However, competition is fierce from established, regulated incumbents who can more easily layer AI onto their existing products. **The Core Insight:** Companies, especially giants with long time horizons, are building defensively for a potential future of mass machine commerce. For them, early investment is a low-cost hedge. For startups, the current market reality is different. The primary challenge isn't just moving money between agents (payments). The larger, unsolved problem is **orchestration** – coordinating work between agents and humans, verifying outcomes, and then settling. Payment is just a part of settlement, which is just a part of orchestration. Companies that solve the orchestration problem will subsume payments, not the other way around. After a year of building, we see the real, growing, and underserved market opportunity lies in this broader domain of orchestration.

链捕手10m ago

It Took Me a Year to See the Hard Truth About Agent Payments

链捕手10m ago

Claude Opus 4.8 Finds a $4.5 Billion Bug: The AI Era is Mass-Producing Hackers

A researcher discovered a critical "infinite mint" vulnerability in the Zcash cryptocurrency's Orchard protocol using Claude Opus 4.8, leading to a swift fix but also a 50% market drop, erasing billions in value. This incident highlights a new era where powerful, accessible AI models are dramatically lowering the barrier to finding software vulnerabilities. Previously, the security community feared specialized models like Claude Mythos Preview, capable of finding decades-old zero-day exploits. The Zcash case, however, involved a publicly available, general-purpose model. This shift makes advanced security auditing—and attack capabilities—accessible to far more people, not just experts. The mass democratization of vulnerability discovery brings a dual challenge: a flood of low-quality, AI-generated false reports that overwhelm maintainers, and the real, rapid uncovering of deep, dangerous bugs. Open-source projects, often understaffed and unfunded, are particularly vulnerable to this "attention DDoS." The article cites examples like curl shutting down its bug bounty program due to the unsustainable workload. Our perceived digital safety has often been luck, relying on the high cost and effort required to find deeply hidden flaws in complex systems, as seen with historical vulnerabilities like Heartbleed or Baron Samedit. AI changes this cost structure, effectively "mass-producing flashlights" to illuminate every corner of our codebase. While large companies operate extensive security chains involving external white-hat hackers and massive defensive operations, the global cybersecurity workforce faces a severe shortage, especially of experienced personnel capable of analyzing complex threats and coordinating fixes. The core dilemma emerges: AI makes *finding* bugs cheap and scalable, but *fixing* them remains a slow, expensive, and human-intensive process. The article concludes that AI won't destroy the internet but acts as a bright light, revealing that our digital existence is not inherently secure but is precariously maintained by ongoing human effort. The true cost in the AI era may not be discovery, but whether there will be enough people left willing and able to do the hard work of repair.

marsbit43m ago

Claude Opus 4.8 Finds a $4.5 Billion Bug: The AI Era is Mass-Producing Hackers

marsbit43m ago

From Ethereum to AI's 'CROPS': What Exactly Is This 'Slow Variable' That Vitalik Has Repeatedly Emphasized?

Recently, Vitalik Buterin has frequently emphasized the concept of "CROPS," first outlined in the Ethereum Foundation's March mandate as core principles guiding its focus: Censorship Resistance, Capture Resistance, Open Source, Privacy, and Security. CROPS represents Ethereum's commitment to providing foundational capabilities for user sovereignty—enabling asset ownership, identity expression, and coordination without reliance on centralized platforms or surrendering ultimate control. This framework is gaining new urgency with the rise of AI, particularly AI agents managing digital assets and automating transactions. While AI offers convenience, it risks centralizing user data, intent, and control if dependent on opaque, centralized services. Vitalik argues for "CROPS AI"—AI that is open, privacy-preserving, secure, and capable of local execution to maintain user agency. He highlights convergence between "CROPS Ethereum access layers" and "CROPS AI," such as using zero-knowledge proofs for private remote LLM calls and Ethereum RPC reads, ensuring users can access services without exposing sensitive information. Ultimately, CROPS is not just an abstract ideal but a practical guide for Ethereum's development and AI integration. It addresses the critical long-term question: as digital systems grow more powerful, how can users retain control over their privacy, assets, and autonomy? In an AI-driven era, these principles may define Ethereum's enduring value—prioritizing verifiable, secure, and user-centric design over short-term optimizations like speed and cost alone.

marsbit1h ago

From Ethereum to AI's 'CROPS': What Exactly Is This 'Slow Variable' That Vitalik Has Repeatedly Emphasized?

marsbit1h ago

Trading

Spot
Futures

Hot Articles

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.

活动图片