How an Anthropic Engineer Saved 300 Million Tokens in a Week: A Claude Code Caching Guide

marsbitОпубликовано 2026-05-24Обновлено 2026-05-24

Введение

Anthropic engineers reveal how prompt caching in Claude Code dramatically reduces token consumption. By reusing already-processed context, cached tokens cost only 10% of regular input tokens. The author saved over 300 million tokens in a week, with 91 million cached tokens in a single day counted as roughly 9 million for billing. Caching works via prefix matching across three layers: system (instructions, tools), project (CLAUDE.md, rules), and conversation (chat history). As long as the beginning of a request matches cached content, Claude reuses it instead of reprocessing. Key points for users: - Claude Code's cache TTL is 1 hour (vs. 5 minutes for API/Sub-agents). Avoid pausing a session beyond this. - Switching models (including enabling "Opus plan" mode) breaks cache, forcing a full reprocess. - For task switching, use a clear session handoff instead of letting an old session expire. - Place large documents in Projects, not directly in chat, for better caching. High cache reuse benefits both users (longer sessions) and Anthropic (lower costs). Monitoring cache hit rates is crucial internally. By managing context as an asset and avoiding cache-breaking habits, users can make their Claude Code sessions more efficient and cost-effective.

Editor's Note: When many people use Claude Code, the most intuitive feeling is that Token consumption is too fast, and long sessions can easily eat up the quota. But from the perspective of an Anthropic engineer, what truly affects cost is often not how much code you write, but whether the system consistently reuses context that has already been processed.

The core of this article is how to save Tokens through caching mechanisms. The author reused over 300 million Tokens through caching in one week, with a single-day cache hit reaching 91 million. Since the cost of a cached Token is only 10% of a regular input Token, this means 91 million cached Tokens are billed roughly equivalent to 9 million regular Tokens. The reason Claude Code long sessions seem more "durable" isn't because the model works for free, but because a large amount of repeated context is successfully reused.

The key to prompt caching is "don't break the cache." Claude Code caches system prompts, tool definitions, CLAUDE.md, project rules, and conversation history in layers; as long as the prefix of subsequent requests remains consistent, Claude can directly read from the cache instead of reprocessing the entire context. Anthropic internally also monitors prompt cache hit rates because it affects not only user quotas but also directly impacts model service costs and operational efficiency.

For ordinary users, you don't need to understand all the underlying details, just master a few key habits: don't let a session sit idle for more than 1 hour; perform a clean session handoff when switching tasks; avoid frequently switching models; put large documents into Projects instead of repeatedly pasting them into the conversation.

This article is less about a Token-saving trick and more about providing a Claude Code usage approach closer to an engineer's mindset: treat context as an asset to manage, let the cache be continuously reused, and make long sessions do less repetitive computation.

The following is the original text:

I saved 300 million Tokens this week, 91 million in a single day, over 300 million in a week.

I didn't change any settings. This is just prompt caching working normally in the background.

But after I truly understood what caching is and how to avoid "breaking" it, with the same usage quota, my sessions could last longer. So, here is a compiled 80/20 introductory guide to Claude Code prompt caching, without delving into deep API-level details.

TL;DR

Cached Tokens cost only 10% of regular input Tokens. 91 million cached Tokens are billed approximately equivalent to 9 million Tokens.

Claude Code subscription cache TTL is 1 hour; API default is 5 minutes; Sub-agent is always 5 minutes.

Caching is divided into three layers: system, project, and conversation.

Switching models mid-conversation breaks the cache, including enabling "opus plan" mode.

How is caching actually billed?

Every cached Token costs 10% of a regular input Token.

So, when my dashboard shows 91 million Tokens hitting cache on a particular day, the actual billing is roughly equivalent to processing only 9 million Tokens. This is also why, compared to having no cache, using Claude Code for long periods makes the session feel almost "free" to extend.

Two numbers on the dashboard are worth focusing on:

Cache create: The one-time cost incurred when writing content to the cache. It starts to take effect in the next round of conversation.
Cache read: Tokens Claude reuses from the cache, such as your CLAUDE.md, tool definitions, previous messages, etc. Cost is 10 times cheaper compared to reprocessing them as input.

If your Cache read number is high, it means you are effectively utilizing the cache; if this number is low, it means you are repeatedly paying for the same context.

Anthropic's Thariq once said something that left a deep impression on me: "We actually monitor prompt cache hit rates, and if the hit rate gets too low, it triggers an alert, even declaring a SEV-level incident."

He also wrote a great X article. When cache hit rates are high, four things happen simultaneously: Claude Code feels faster, Anthropic's service costs decrease, your subscription quota seems more durable, and long coding sessions become more realistic.

But if the hit rate is low, everyone loses.

So, the incentives are actually aligned: Anthropic wants your cache hit rate higher, and you yourself want the hit rate higher. What truly holds things back are some seemingly insignificant habits that quietly reset the cache.

How does the cache grow with each conversation turn?

Caching relies on prefix matching.

Without getting too deep into technical details, you just need to understand one thing: as long as the content before a certain position is completely identical to what's already cached, Claude can reuse those cached Tokens.

A brand new session typically unfolds like this:

According to Claude Code documentation, a fresh session usually runs like this:

First conversation turn: No cache exists yet. The system prompt, your project context (like CLAUDE.md, memory, rules), and your first message are all processed from scratch and written to the cache.

Second conversation turn: All content from the first turn is now cached. Claude only needs to process your new reply and the next message. This round is much cheaper.

Third conversation turn: Same logic. Previous conversation remains in the cache, only the latest round of interaction needs reprocessing.

The cache itself can be divided into three layers:

From Thariq's X article:

System layer: Includes base instructions, tool definitions (read, write, bash, grep, glob), and output style. This layer is cached globally.

Project layer: Includes CLAUDE.md, memory, project rules. This layer is cached per project.

Conversation layer: Includes replies and messages, growing with each conversation turn.

If anything in the system or project layer changes mid-session, everything must be recached from scratch. This is the most "expensive" operation. Imagine: you're already on the 16th message, then suddenly change the system prompt, or pause for an hour, all Tokens from message 1 onward need to be reprocessed.

The confusion between 1 hour and 5 minutes

This is the most easily misunderstood point.

Claude Code subscription: Default TTL is 1 hour.

Claude API: Default TTL is 5 minutes. You can pay a higher cost to increase it to 1 hour.
Sub-agent on any plan: Always 5 minutes.

Claude.ai web chat: Not officially documented. Likely same as subscription, but I haven't confirmed.

A few months ago, many people complained that Claude subscription quotas were being consumed too quickly. Some thought Anthropic had quietly reduced TTL from 1 hour to 5 minutes without notifying users. But that wasn't the case; Claude Code's TTL remains 1 hour.

The problem is, Claude Code and API documentation are kept separate, and these are two completely different things, leading to much confusion.

If you're running many Sub-agent workflows, or using the API directly, the 5-minute figure is important. But for 95% of Claude Code users, what really matters is that 1-hour window.

Three habits that cover 95% of users

The following are parts I find truly useful for daily use.

Don't pause too long

If you've been idle for more than an hour, previous content has mostly expired from the cache. Your next message will rebuild the cache. In such cases, instead of resuming an old session that has "gone cold," it's often cheaper to do a clean handoff and start a new session.

When switching tasks, just start fresh

/compact or /clear inherently break the cache, so it's better to use that moment for a true reset.

I made a session handoff skill to replace /compact. It summarizes what we've completed, what pending decisions remain, which files are most important, and where to continue next. Then I run /clear, paste this summary in, and can proceed as if nothing was interrupted.

The compact command sometimes runs slowly too. This handoff skill usually finishes in under a minute.

In Claude chat, put large documents into Projects

The caching mechanism on Claude.ai isn't officially documented in great detail, but Projects clearly use different optimizations compared to regular chat threads. So, if you need to paste large documents, it's better to put them in a Project rather than directly into the chat.

Which operations quietly break the cache?

A few things can completely reset the cache without obvious warning.

Switching models: Because caching relies on prefix matching, and each model has its own cache. Switching models means the next request will read the full history with no cache hits.

"Opus plan" mode: This setting uses Opus for planning and Sonnet for execution. I recommended it in some token optimization videos for a reason. But it's important to understand that each plan switch is essentially a model switch, meaning the cache must be rebuilt. In the long run, it still helps extend session quota, but you need to know what's happening under the hood.

Editing CLAUDE.md mid-session is okay: This change doesn't take effect immediately; it applies on the next restart. Therefore, the currently running cache isn't affected.

My free Token dashboard

The screenshot I showed earlier is from a token dashboard.

It's a simple GitHub repo. You give the link to Claude Code, have it deploy locally on localhost, and it will read all your past session records instead of starting statistics from scratch. You immediately see daily input, output, cache create, and cache read data.

One thing to note: this dashboard counts Token data on your local device. If you switch from desktop to laptop, the numbers won't match exactly. Each device has its own statistical view.

Summary

Prompt caching is something you can research deeply. Thariq's article covers it more completely than here; if you want the full picture, it's worth reading.

But you don't need to understand all the details to benefit. You just need to grasp the key 80/20: cached Tokens are 10 times cheaper than regular Tokens; Claude Code TTL is 1 hour; switching models breaks the cache; making a clean handoff between tasks is usually more cost-effective than forcing an old session back to life after it "expires."

Связанные с этим вопросы

QWhat is the core mechanism described in the article for significantly reducing Claude Code costs?

AThe core mechanism is prompt caching, where previously processed context (system prompts, tool definitions, project files, conversation history) is stored and reused. If a new request's prefix matches the cached content, Claude reads from the cache instead of reprocessing, costing only 10% of a standard input token.

QWhat is the key financial benefit of high cache read rates mentioned in the article?

ATokens read from the cache are billed at only 10% of the cost of standard input tokens. This means high cache read rates make a Claude Code subscription last significantly longer, as repeated context is not fully re-processed and paid for.

QWhat are the three main layers of prompt caching in Claude Code, according to the article?

AThe three layers are: 1. System Layer (global cache for base instructions, tool definitions like read/write/bash, and output style). 2. Project Layer (cached per project, includes CLAUDE.md, memory, project rules). 3. Conversation Layer (cached per session, includes the growing history of messages and replies).

QWhat is the single most common user action that will completely reset the prompt cache in an active Claude Code session?

ASwitching the model mid-conversation. Because cache relies on prefix matching and each model has its own cache, switching models forces Claude to reprocess the entire context from scratch with no cache hits.

QWhat practical habit does the author recommend for switching tasks to better preserve caching benefits?

APerform a clear session handoff instead of letting a session idle past the TTL. The author uses a custom 'session handoff skill' to summarize progress, pending decisions, and key files before using /clear. This summary is pasted into the new session, providing continuity while allowing a fresh, efficiently cached session to begin.

Похожее

Alibaba 'Stocks Up', ByteDance 'Trains'

"In late May, two closely timed events in China's AI industry clearly revealed the divergent strategic approaches of two tech giants: Alibaba and ByteDance. Alibaba is aggressively integrating AI into its existing commercial ecosystem, prioritizing immediate monetization. Its Qwen App now fully integrates with Taobao, leveraging the platform's 4-billion-item database for AI-powered shopping features like virtual try-on and price comparison. Internally, Alibaba has reorganized to incentivize AI-driven business growth, notably through the 'Agentic Commerce Trust Protocol' to enable AI-agent transactions. Financially, it emphasizes ROI, with CEO Daniel Wu stating every AI chip purchased is generating revenue. Alibaba's strategy bets that foundational AI model capabilities won't be leapfrogged in the next five years, allowing its 'AI-as-a-utility' approach to succeed. In stark contrast, ByteDance's Seed division focuses on pushing the frontiers of AGI with a long-term, research-oriented mindset. Its video generation model, Seedance 2.0, topped international benchmarks. The division, led by researchers Wu Yonghui and product head Zhu Wenjia, is tasked with 'exploring the upper limits of intelligence,' even considering open-sourcing its models—a rare move among Chinese firms. ByteDance is investing heavily, with reports of its 2026 capital expenditure plan being nearly triple that of 2024, funded by its substantial private profits. This allows it to pursue projects like an 8-month research paper questioning if video models are true 'world models,' devoid of immediate commercial pressure. The core divergence is less about corporate philosophy and more about structural constraints. As a publicly traded company, Alibaba is bound to quarterly financial expectations, forcing a pragmatic, revenue-focused AI integration. As a private entity, ByteDance has the luxury to fund long-term, high-risk foundational research without answering to public markets. The article concludes that the true determinant of a Chinese company's AI path is its IPO status, suggesting that if ByteDance were public, or if Alibaba were private, their strategies might well be reversed."

marsbit10 мин. назад

Alibaba 'Stocks Up', ByteDance 'Trains'

marsbit10 мин. назад

Why More AI Agents Does Not Equal Higher Productivity?

Editor's Note: As AI Agents become cheaper and easier to use, a new constraint emerges: the cost isn't in launching more Agents, but in the human attention required to manage, judge, and integrate their outputs. This hidden cost is called the "orchestration tax." The article argues that a developer's cognitive bandwidth is the key bottleneck—a serial, non-parallelizable resource akin to a Global Interpreter Lock (GIL). While many Agents can run concurrently, their results ultimately require human judgment for review, conflict resolution, and final integration. Therefore, more Agents don't automatically mean higher productivity; they can simply create longer queues, lead to cognitive fatigue, and create the illusion of busyness without real output. The core solution is to design workflows around this scarce human attention. Key strategies include: scaling the number of Agents to match review capacity (not UI capacity), categorizing tasks (delegating independent ones, keeping complex judgment-heavy ones serial), batch reviewing results to minimize context-switching costs, automating verifiable checks to reserve human judgment for critical decisions, and protecting focused, uninterrupted thinking time. Ultimately, the critical skill is not launching many Agents, but architecting systems that respect the fundamental limit of human attention. Unpaid "orchestration tax" accumulates as both technical and cognitive debt, undermining system understanding and quality. True productivity comes from thoughtfully managing the single-threaded resource—your focus.

marsbit1 ч. назад

Why More AI Agents Does Not Equal Higher Productivity?

marsbit1 ч. назад

Three Years Later: Looking Back at My Predictions About ChatGPT in 2023

Three Years Later: Revisiting My 2023 Predictions on ChatGPT In March 2023, shortly after ChatGPT's launch, I made 20 predictions about its future. Now, in mid-2026, I've used AI agents to fact-check each one against the latest data. Overall, most major directional forecasts were correct, with only one outright error (incorrectly stating GPT-4 had 100 trillion parameters). Key successes included predicting that RAG and retrieval architectures would become the standard for handling knowledge and hallucinations, that natural language interfaces (LUI) would create a massive new industry layer beyond the models themselves, and that China would develop viable large language models, significantly closing the performance gap with Western counterparts within about three years. Predictions about the absence of mass unemployment, the rise of a new "robot network" for agent communication, and ChatGPT not possessing consciousness also held true in their core arguments. However, the "devil was in the details." Errors frequently involved specific numbers, timelines, or overlooking distributional effects. I tended to overestimate the speed of adoption (e.g., for agent networks) while underestimating the ultimate scale of capabilities or costs (e.g., AI winning IMO gold without tools, or the extreme capital required for frontier models). Other misjudgments included: underestimating how AI would reinforce, not dissolve, information filter bubbles; incorrectly assuming AI-generated content would easily circumvent copyright (it has instead triggered record-breaking settlements); and misidentifying where value would be captured (it accrued overwhelmingly to the compute layer, like Nvidia, not just the application or model layers). Key lessons from reviewing these predictions are: 1) Directional and mechanistic insights are far more reliable than precise numbers or absolute statements. 2) There's a consistent bias to overestimate short-term speed but underestimate long-term magnitude. 3) Errors often lie in missing distributional impacts within a generally correct aggregate trend. 4) Predictions phrased with nuance and caveats aged the best. 5) Some fundamental debates (e.g., on machine consciousness or the ultimate value chain) remain unresolved even after three years. This exercise is less about scoring the past and more about establishing rules for clearer thinking about the next three years of AI.

marsbit8 ч. назад

Three Years Later: Looking Back at My Predictions About ChatGPT in 2023

marsbit8 ч. назад

Three Years Later: Looking Back on My 2023 Predictions for ChatGPT

Looking Back After Three Years: Revisiting My 2023 Predictions on ChatGPT In March 2023, shortly after ChatGPT's debut and before GPT-4's release, I made over twenty predictions about AI's future based on limited information and intuition. Now, in May 2026, I revisited those forecasts using an AI-driven analysis with 41 Opus 4.8 agents to cross-reference them with the latest data. The assessment used symbols: ✅ Correct, 🟢 Mostly Correct, 🟡 Partially Correct, ❌ Incorrect. Overall, the directional judgments held up well, with only one major factual error regarding GPT-4's rumored parameter size (incorrectly cited as 100T). However, nuances and degrees of accuracy revealed more. **What Was Largely Correct:** Predictions about mechanisms and directions proved accurate. The rise of RAG (Retrieval-Augmented Generation) as the standard architecture for combating AI hallucination was confirmed, as was the transformative potential of LUI (Language User Interface) in creating a new industry layer atop GUIs. The emergence of "robot networks" (agent-to-agent communication protocols) and China's rapid catch-up in developing capable large models (closing the performance gap with top models to ~2.7%) were also on point. The analysis affirmed that LLMs lack consciousness and that the Turing Test merely measures perceived intelligence. **What Was Off Target:** Errors often involved specific numbers, over-optimistic timelines, or misjudged distributions. The prediction that value would primarily accrue to the application layer was half-right but missed NVIDIA's dominance as the profitable infrastructure layer. Forecasts about AI circumventing copyright issues and fostering a "global common ground" by averaging human viewpoints were incorrect; instead, major copyright settlements occurred and AI personalization is increasing. Estimates for model training costs ("$5-10 billion cap") were significantly off, underestimating frontier costs and overestimating replication costs. The notion that LLMs could never do complex math without tools was disproven by later models winning IMO gold. **Key Patterns from the Review:** 1. **Direction over precision:** Judgments about mechanisms and trends were more reliable than specific numbers or definitive statements. 2. **Timing bias:** There was a tendency to overestimate short-term speed but underestimate long-term magnitude and transformation. 3. **The distribution blind spot:** Aggregate-level correctness often masked uneven impacts (e.g., on young professionals' employment). 4. **The value of qualifiers:** Predictions framed with caution (e.g., "reportedly," "for now," "prototype in 2-3 years") aged better. 5. **Some debates continue:** Issues like the nature of "emergent abilities" or machine consciousness remain unresolved. This three-year review highlights that while seeing the big picture is crucial, humility regarding specifics, timelines, and disparate impacts is essential for future forecasting.

链捕手10 ч. назад

Three Years Later: Looking Back on My 2023 Predictions for ChatGPT

链捕手10 ч. назад

Торговля

Спот
Фьючерсы

Популярные статьи

Как купить PEOPLE

Добро пожаловать на HTX.com! Мы сделали приобретение ConstitutionDAO (PEOPLE) простым и удобным. Следуйте нашему пошаговому руководству и отправляйтесь в свое крипто-путешествие.Шаг 1: Создайте аккаунт на HTXИспользуйте свой адрес электронной почты или номер телефона, чтобы зарегистрироваться и бесплатно создать аккаунт на HTX. Пройдите удобную регистрацию и откройте для себя весь функционал.Создать аккаунтШаг 2: Перейдите в Купить криптовалюту и выберите свой способ оплатыКредитная/Дебетовая Карта: Используйте свою карту Visa или Mastercard для мгновенной покупки ConstitutionDAO (PEOPLE).Баланс: Используйте средства с баланса вашего аккаунта HTX для простой торговли.Третьи Лица: Мы добавили популярные способы оплаты, такие как Google Pay и Apple Pay, для повышения удобства.P2P: Торгуйте напрямую с другими пользователями на HTX.Внебиржевая Торговля (OTC): Мы предлагаем индивидуальные услуги и конкурентоспособные обменные курсы для трейдеров.Шаг 3: Хранение ConstitutionDAO (PEOPLE)После приобретения вами ConstitutionDAO (PEOPLE) храните их в своем аккаунте на HTX. В качестве альтернативы вы можете отправить их куда-либо с помощью перевода в блокчейне или использовать для торговли с другими криптовалютами.Шаг 4: Торговля ConstitutionDAO (PEOPLE)С легкостью торгуйте ConstitutionDAO (PEOPLE) на спотовом рынке HTX. Просто зайдите в свой аккаунт, выберите торговую пару, совершайте сделки и следите за ними в режиме реального времени. Мы предлагаем удобный интерфейс как для начинающих, так и для опытных трейдеров.

756 просмотров всегоОпубликовано 2024.04.12Обновлено 2025.03.21

Как купить PEOPLE

Обсуждения

Добро пожаловать в Сообщество HTX. Здесь вы сможете быть в курсе последних новостей о развитии платформы и получить доступ к профессиональной аналитической информации о рынке. Мнения пользователей о цене на PEOPLE (PEOPLE) представлены ниже.

活动图片