VIBE
← Back to Leaderboard
OtherARTICLE

Prompt caching in LLMs, clearly explained

x.com
OtherFreeARTICLE2mo ago

About

A case study on how Claude achieves 92% cache hit-rate Every time an AI agent takes a step, it sends the entire conversation history back to the LLM. That includes the system instructions, the tool definitions, and the project context it already processed three turns ago. All of it gets re-read, re-processed, and re-billed on every single turn. For long-running agentic workflows, this redundant computation is often the most expensive line item in your entire AI infrastructure. A system prompt with 20,000 tokens running over 50 turns means 1 million tokens of redundant computation billed at full price, producing zero new value. And that cost compounds across every user and every session. The fix is prompt caching. But to use it well, you need to understand what’s actually happening under the hood. Static vs. Dynamic context Before you can optimize a prompt, you need to understand what changes and what doesn’t. Every agent request has two fundamentally different parts: The static prefix that stays identical across turns: system instructions, tool definitions, project context, and behavioral guidelines. The dynamic suffix that grows with every turn: user messages, assistant responses, tool outputs, and terminal observations. This split is what makes prompt caching possible. The infrastructure stores the mathematical state of the static prefix so that subsequent requests sharing that exact prefix can skip the computation entirely and read from memory. Once you internalize this, every architectural decision in this article becomes obvious. How does the KV Cache work? To understand why caching is so effective, you need to know what the transformer actually does when it processes your prompt. Every LLM inference request has two phases: The prefill phase handles the entire input prompt. It runs dense matrix multiplications across all tokens in context to build the model’s internal representation. This is compute-bound and expensive. The decode phase generates toke

Why it made the leaderboard

A concrete teardown of how Claude hits a 92% cache hit-rate, and why it matters: every agent step resends the whole history, so caching is the line between affordable and ruinous. Read it if your agent loops are quietly burning tokens.

Comments

No comments yet.