hunter paulson
LLM API providers are charging you twice for output tokens
2026-07-04
|
|||||
| blog | art | projects | github | ||
Under agentic inference patterns current apis charge you twice for the output tokens. Once when they are generated, at output price, and again, at cache write price, when they are written to the prompt prefix cache on the subsequent api call.
It doesn’t have to be this way. Open source inference engines, SGLang and vLLM, retain the KV cache for both prompts and generations.
LLM API provider’s inference engines are almost surely capable of this under the hood as well, however current apis don’t give us a way to request the retention of KVs for output tokens.
There needs to be a way for us to signal, or even pay, them to store output tokens as well.
LLM APIs only allow us to request that computation done for input tokens is retained so that it can be reused on future requests with the same prompt prefix. This is called prompt caching.
This worked fine for chatbots (see appendix), but with current ‘agentic’ inference patterns this is no longer sufficient. This is becasue agent loops always1 reuse the output from the previous result on the very next api request.
Here is what this looks like at Anthropic’s Fable 52 api prices:
cache write; written to cache; cache read; new this step;
Notice that you pay for assistant tokens, with tool calls, twice.
For every token generated before the last api request in the agent loop you pay output tokens at least 2 times. First you pay for the tokens as they are generated (result 1). Then on the very next request after you execute the tools and append the tool results to the context you pay to add the tokens to the ‘prompt prefix cache’ (request 2). Then you pay to read them from cache on every following request before compaction.
~16-25% more than advertised
Naively, if we are paying output price upon generation and then cache write input price on the next request then we can just add up their costs to get a estimate of how much we are really paying for ‘agentic output tokens’3.
output + cache WRITE input = cost of 'agentic' output tokens
$50.00 + $12.50 = $62.50 / 1M tokens
$62.50 / $50.00 = 1.25 => 25% more than advertised output price
output + cache WRITE input = cost of 'agentic' output tokens
$30.00 + $5.00 = $35.00 / 1M tokens
$35.00 / $30.00 = 1.1667 => 16.67% more than advertised output price
note: we have to pay for every token in every request, so ‘agentic output tokens’ will always have some cost for each request. However, as we will see, that cost should be at cache read pricing, not cache write which is 10-12.5x more expensive.
From the perspective of the prompt prefix cache there is no difference between input and output tokens, it is all just tokens.
Just like they retain the KV cache blocks for input tokens, providers can retain the KV cache blocks created during generation so we no longer have to pay for cache write on the next call.
Lets take a look at what goes on under the hood with the prompt prefix cache for our two requests in the example above.
input token output token cache write retained cache read not retained
During inference, Key and Value vectors (KVs) are generated for every token. KVs for input tokens are generated all at once during prefill. And the K and V for each output token is generated one at time during autoregressive decode4.
After generation decode is complete (because LLM completed a tool call) the KVs are retained in blocks5 so they can be reused, instead of recomputed, during the subsequent request.
However only KV blocks for input tokens are retained, meaning that output tokens from request 1 must be recomputed during prefill stage of request 2.
If this sounds redundant that’s because it is. There is nothing preventing them from retaining the KV blocks for the output tokens too. In fact this is exactly what the open source inference engines SGLang6 and vLLM7 do.
Never
prefilltokens you justdecoded
Instead of discarding the KVs for output tokens we can just retain their blocks with the rest of the prompt prefix cache at the end of request 1.
Then on our next request (request 2) all of our tokens from request 1 will be in the prompt prefix cache, saving us from recomputing anything during prefill.
Notice how there is no longer any overlap between prefill and decode across requests.
If API providers retain output tokens in the prompt prefix cache then we only have to pay cache read price for every subsequent request that contains them.
cache write & new this step; written to cache; cache read;
Notice how now we only pay full price for new input or output tokens the first time they appear. From then on they are always ‘cache read input tokens’.
~12.9-18.4% on output tokens, depending on the provider.
Now that we understand how prompt prefix caching should work lets estimate how much we should ideally be paying.
output + cache WRITE input = CURRENT cost of output tokens
$50.00 + $12.50 = $62.50 / 1M tokens
output + cache READ input = IDEAL cost of output tokens
$50.00 + $1.00 = $51.00 / 1M tokens
$62.50 - $51.00 = $11.50 extra per 1M output tokens
$51.00 / $62.50 = 0.816 => 18.4% more than necessary
output + cache WRITE input = CURRENT cost of output tokens
$30.00 + $5.00 = $35.00 / 1M tokens
output + cache READ input = IDEAL cost of output tokens
$30.00 + $0.50 = $30.50 / 1M tokens
$35.00 - $30.50 = $4.50 extra per 1M output tokens
$30.50 / $35.00 = 0.871 => 12.9% more than necessary
Time to first token (TTFT):
No8 recomputation means less computation done during prefill. and prefill is compute bound less computation means users get the first token faster.
Cache Hit Rate:
the SGLang Radix Attention Paper defines the “cache hit rate as number of cached prompt tokens / number of prompt tokens”
Since output tokens are part of the subsequent prompt it is trivial to see that having them cached will improve the cache hit rate, up to its theoretical limit.
however it is impossible to reach 100% under this definition since entirely new tokens are appended to the context each model request, e.g. tool results or new user prompts. instead we should measure cache read input tokens / total tokens at the end of previous request to have higher signal into true cache reuse and more easily see when we invalidate the prompt prefix cache.
APIs were designed for chatbot applications.
NOTE: this section is entirely speculative
engineers at these labs are smarter than me so somebody has to know this. Just incentives make it so
everyone is pretty much standardized on three api formats: OpenAI completions and responses and Anthropic’s messages.
so other inference providers just use the same standard.
so until someone implements this then nobody has to. but as soon as one does then everyone does. since its hard to compete with a >10% discount.
Their internal inference engines probably already do this for their internal inference.
if anthropic doesn’t already do output KV reuse internally then this is quite literally a compute multiplier for them.
note this may not save compute at all because it is almost guaranteed that providers already do this anyway.
While it is possible that providers already do this under the hood but since that would be a PR nightmare I doubt it.
they need to proide a way for users to request that the output token blocks are retained in the prompt prefix cache.
For providers with implicit caching like OpenAI and Gemini they can make this change entirely on the backend since they already handle caching for their users.
For providers with explicit caching like Anthropic this requires an update to the API.
if people are already using automatic caching, Anthropic could extend their top level cache_control object with an optional key.
# call
response = client.messages.create(
model="claude-fable-5",
max_tokens=1024,
####################################
cache_control={ # existing automatic caching param
"cache_output": True, # new, indicates to cache output
"type": "ephemeral" # or e.g. "intelligent" to only retain cache if stop_reason == "tool_use"
"ttl": "5m"
},
####################################
system="You are a helpful assistant.",
messages=[
{
"role": "user",
"content": "do something agentic (call tools in a loop for me) please",
}
],
)
# response
{
"content": ...
"usage": {
"input_tokens": 2048,
"cache_read_input_tokens": 1800,
"cache_creation_input_tokens": 248,
"output_tokens": 503,
"cache_creation_output_tokens" 503, # new
}
}
Again, providers where caching is already priced in9 don’t need to change anything.
however anthropic charges extra for storing the KV cache between requests. they charge 25% of the price of input tokens to store these in cache for up to 5 minutes. essentially this is a flat cost paid per token when that token is retained in the prompt prefix cache between api requests. currently it is only applied to input tokens but since there is no fundamental difference between caching input and output tokens it would make sense to have a single price for caching any type of token.
If they continue with the same pricing model they would likely need to add something akin to a cache_retention_per_1M_tokens. lets assume this would be at the same 0.25x input token price it is currenlty.
so for Fable 5 this would be $10.00 * 0.25 = $2.50 per 1M tok
output + cache WRITE tax + cache READ input = LIKELY cost of output tokens
$50.00 + $2.50 + $1.00 = $53.50 / 1M tokens
$62.50 - $53.50 = $9.00 saved per 1M output tokens
$53.50 / $62.50 = 0.856 => 14.4% savings
Not quite as good as the 18.4% from above but still a free ~15% savings.
as we discussed earlier, we expect every assistant message with tool calls to have a follow up api request, with tool results, before the cache expiration.
however eventually the loop stops and we don’t want to pay to cache those final output tokens do we?
it depends … on whether or not the user will send a follow up message/request before the cache retention period expries
so how do we know when to cache?
… we dont
… but the llm does
Notice how whether or not we want to cache depends on what the LLM decides to do.
If the llm calls a tool we want to cache. If the llm does not then we probably don’t since we have no guarantee of an subsequent api request that will make use of whatever we just cached.
and it turns out that implementing this is pretty simple: providers can distinguish between these two cases based on the stop_reason for why decode ended. Cache when stop_reason == "tool_use" and don’t cache when stop_reason in ("end_turn", "refusal").
Notice that this doesn’t just have to apply to the proposed cache write output tokens. we can apply this conditional cache retention to all new tokens in this request.
this would be huge because it would allow callers to not request a cache write for any new tokens (input and output) if there aren’t any tool calls.
I don’t believe that prompt caching APIs are purposely built to charge you twice for the current most common usage pattern (append-only agentic inference). The issue is that common usage patterns have shifted significantly since the APIs were first designed.
prompt caching APIs were initially designed for chat applications (e.g ChatGPT) where it allowed users of the api to reuse the cache for the system message across chats for all users.
cache write; written to cache; cache read; new this step / not cached;
could be the same user or a different user
prompt caching works alright for this as well. but tbh we could have seen this issue back then. both SGLang and vLLm did.
cache write; written to cache; cache read; new this step;
notice that for each request after the first we are writing the Assistant message from the previous turn to the cache.
if you use the cache even once it is worth the price
cache_write = 1.25 x input cache_read = 0.1 x input
so if you write 1000 tokens to the cache you pay for 1250 tokens then on next api call, with the same prefix, you get a cache hit and read all those 1000 cached tokens at the price of 100 tokens. so you paid for the equivalent of 1350 input tokens.
however if you don’t cache you pay regular price for 1000 tokens then if you make a request with the same prefix you pay regular price again for 1000 tokens. so on the second request you already paid more that if you had cached. not to mention that caching also improves api response time as well.
my takeaway from this is:
if you are going to make another llm api call with the same prompt prefix, in the next 5 minutes, you should write that prefix to cache
in vanilla11 multi-turn conversation it is unrealistic to predict whether or not the user will ask a follow up question ex ante so builders on the api either always or never pay the cache write cost depending on their apps usage patterns.
notice how with the shift from vanilla multi-turn conversations to agent loops it actually got easier to predict whether or not we will reuse the cache: if the assistant calls a tool in its reponse then we know that we will reuse the cache in the very next request. This is why builders now always pay the cache write cost.
messages: list[Message] = [
SystemMessage(content="You are a helpful assistant."),
UserMessage(content="What is the weather in Tokyo?"),
]
while True:
assistant_message: AssistantMessage = llm.generate(messages)
messages.append(assistant_message.content)
# if there are no tool calls, we are done
if not assistant_message.tool_calls:
return assistant_message.content
# otherwise execute tool calls, append tool results and repeat
for tool_call in assistant_message.tool_calls:
tool_result: ToolResultMessage = execute_tool(tool_call)
messages.append(tool_result.content)