hunter paulson

LLM API providers are charging you twice for output tokens
blog art projects github linkedin

LLM API providers are charging you twice for output tokens

Under agentic inference patterns current apis charge you twice for the output tokens. Once when they are generated, at output price, and again, at cache write price, when they are written to the prompt prefix cache on the subsequent api call.

It doesn’t have to be this way. Open source inference engines, SGLang and vLLM, retain the KV cache for both prompts and generations.

LLM API provider’s inference engines are almost surely capable of this under the hood as well, however current apis don’t give us a way to request the retention of KVs for output tokens.

There needs to be a way for us to signal, or even pay, them to store output tokens as well.

all apis only cache the prompt (input), not the output.

LLM APIs only allow us to request that computation done for input tokens is retained so that it can be reused on future requests with the same prompt prefix. This is called prompt caching.

This worked fine for chatbots (see appendix), but with current ‘agentic’ inference patterns this is no longer sufficient. This is becasue agent loops always1 reuse the output from the previous result on the very next api request.

Here is what this looks like at Anthropic’s Fable 52 api prices:

legend

cache write; written to cache; cache read; new this step;

request 1

system message and tool definitions You are a weather man $12.50 / 1M

user message what is the weather in Tokyo? $12.50 / 1M

result 1

system message and tool definitions You are a weather man $12.50 / 1M

user message what is the weather in Tokyo? $12.50 / 1M

assistant message I’ll use my get_weather tool to get the weather in Tokyo $50.00 / 1M

tool call(s) <tool name=get_weather> <param name=city>Tokyo</param> </tool> $50.00 / 1M

initial request only retains input tokens in the prompt cache even though all messages will be prefix of next request
request 2

system message and tool definitions You are a weather man $1.00 / 1M

user message what is the weather in Tokyo? $1.00 / 1M

assistant message I’ll use my get_weather tool to get the weather in Tokyo $12.50 / 1M

tool call(s) <tool name=get_weather> <param name=city>Tokyo</param> </tool> $12.50 / 1M

tool result(s) Cloudy $12.50 / 1M

result 2

system message and tool definitions You are a weather man $1.00 / 1M

user message what is the weather in Tokyo? $1.00 / 1M

assistant message I’ll use my get_weather tool to get the weather in Tokyo $12.50 / 1M

tool call(s) <tool name=get_weather> <param name=city>Tokyo</param> </tool> $12.50 / 1M

tool result(s) Cloudy $12.50 / 1M

assistant message The weather in Tokyo is Cloudy $50.00 / 1M

since the assistant message and tool call weren’t cached after output you have to pay cache write input price for tokens you already paid full output price for

Notice that you pay for assistant tokens, with tool calls, twice.

For every token generated before the last api request in the agent loop you pay output tokens at least 2 times. First you pay for the tokens as they are generated (result 1). Then on the very next request after you execute the tools and append the tool results to the context you pay to add the tokens to the ‘prompt prefix cache’ (request 2). Then you pay to read them from cache on every following request before compaction.

how much are we actually paying for output tokens?

~16-25% more than advertised

Naively, if we are paying output price upon generation and then cache write input price on the next request then we can just add up their costs to get a estimate of how much we are really paying for ‘agentic output tokens’3.

output + cache WRITE input = cost of 'agentic' output tokens
$50.00 + $12.50            = $62.50 / 1M tokens

$62.50 / $50.00 = 1.25 => 25% more than advertised output price
output + cache WRITE input = cost of 'agentic' output tokens
$30.00 + $5.00             = $35.00 / 1M tokens

$35.00 / $30.00 = 1.1667 => 16.67% more than advertised output price

note: we have to pay for every token in every request, so ‘agentic output tokens’ will always have some cost for each request. However, as we will see, that cost should be at cache read pricing, not cache write which is 10-12.5x more expensive.

it doesn’t have to be this way

From the perspective of the prompt prefix cache there is no difference between input and output tokens, it is all just tokens.

Just like they retain the KV cache blocks for input tokens, providers can retain the KV cache blocks created during generation so we no longer have to pay for cache write on the next call.

Lets take a look at what goes on under the hood with the prompt prefix cache for our two requests in the example above.

legend

input token output token cache write retained cache read not retained

request 1:

prefill
system
KV
user
KV
decode
assistant
KV
tool call
KV
during prefill and decode KVs are generated for input and output tokens respectively. However only KVs for input tokens are retained in the prompt prefix cache between requests

request 2:

cache lookup/restore
system
KV
user
KV
prefill
assistant
KV
tool call
KV
result
KV
decode
assistant
KV
assistant and tool call KVs that weren’t retained from the previous request are now recomputed during prefill

During inference, Key and Value vectors (KVs) are generated for every token. KVs for input tokens are generated all at once during prefill. And the K and V for each output token is generated one at time during autoregressive decode4.

After generation decode is complete (because LLM completed a tool call) the KVs are retained in blocks5 so they can be reused, instead of recomputed, during the subsequent request.

However only KV blocks for input tokens are retained, meaning that output tokens from request 1 must be recomputed during prefill stage of request 2.

If this sounds redundant that’s because it is. There is nothing preventing them from retaining the KV blocks for the output tokens too. In fact this is exactly what the open source inference engines SGLang6 and vLLM7 do.

how it should look

Never prefill tokens you just decoded

Instead of discarding the KVs for output tokens we can just retain their blocks with the rest of the prompt prefix cache at the end of request 1.

Then on our next request (request 2) all of our tokens from request 1 will be in the prompt prefix cache, saving us from recomputing anything during prefill.

request 1:

prefill
system
KV
user
KV
decode
assistant
KV
tool call
KV
KVs generated during decode are retained with the rest of the prompt prefix cache so they can be resued in the next request

request 2:

cache lookup/restore
system
KV
user
KV
assistant
KV
tool call
KV
prefill
result
KV
decode
assistant
KV
now only the KVs for the tool result are computed during prefill since KVs for output tokens from the previous request were retained

Notice how there is no longer any overlap between prefill and decode across requests.

how this looks from the API user perspective

If API providers retain output tokens in the prompt prefix cache then we only have to pay cache read price for every subsequent request that contains them.

legend

cache write & new this step; written to cache; cache read;

request 1

system message and tool definitions You are a weather man $12.50 / 1M

user message what is the weather in Tokyo? $12.50 / 1M

result 1

system message and tool definitions You are a weather man $12.50 / 1M

user message what is the weather in Tokyo? $12.50 / 1M

assistant message I’ll use my get_weather tool to get the weather in Tokyo $50.00 / 1M

tool call(s) <tool name=get_weather> <param name=city>Tokyo</param> </tool> $50.00 / 1M

Now assistant message and tool call are retained in prompt prefix cache
request 2

system message and tool definitions You are a weather man $1.00 / 1M

user message what is the weather in Tokyo? $1.00 / 1M

assistant message I’ll use my get_weather tool to get the weather in Tokyo $1.00 / 1M

tool call(s) <tool name=get_weather> <param name=city>Tokyo</param> </tool> $1.00 / 1M

tool result(s) Cloudy $12.50 / 1M

result 2

system message and tool definitions You are a weather man $1.00 / 1M

user message what is the weather in Tokyo? $1.00 / 1M

assistant message I’ll use my get_weather tool to get the weather in Tokyo $1.00 / 1M

tool call(s) <tool name=get_weather> <param name=city>Tokyo</param> </tool> $1.00 / 1M

tool result(s) Cloudy $12.50 / 1M

assistant message The weather in Tokyo is Cloudy $50.00 / 1M

On the next request input assistant message and tool call are charged at cache read pricing. Only the tool result needs to pay cache input price

Notice how now we only pay full price for new input or output tokens the first time they appear. From then on they are always ‘cache read input tokens’.

how much money would this save?

~12.9-18.4% on output tokens, depending on the provider.

Now that we understand how prompt prefix caching should work lets estimate how much we should ideally be paying.

output + cache WRITE input = CURRENT cost of output tokens
$50.00 + $12.50            = $62.50 / 1M tokens

output + cache READ  input = IDEAL   cost of output tokens
$50.00 + $1.00             = $51.00 / 1M tokens

$62.50 - $51.00 = $11.50 extra per 1M output tokens
$51.00 / $62.50 = 0.816 => 18.4% more than necessary
output + cache WRITE input = CURRENT cost of output tokens
$30.00 + $5.00             = $35.00 / 1M tokens

output + cache READ  input = IDEAL   cost of output tokens
$30.00 + $0.50             = $30.50 / 1M tokens

$35.00 - $30.50 = $4.50 extra per 1M output tokens
$30.50 / $35.00 = 0.871 => 12.9% more than necessary

this improves performance too

Time to first token (TTFT):

No8 recomputation means less computation done during prefill. and prefill is compute bound less computation means users get the first token faster.

Cache Hit Rate:

the SGLang Radix Attention Paper defines the “cache hit rate as number of cached prompt tokens / number of prompt tokens

Since output tokens are part of the subsequent prompt it is trivial to see that having them cached will improve the cache hit rate, up to its theoretical limit.

however it is impossible to reach 100% under this definition since entirely new tokens are appended to the context each model request, e.g. tool results or new user prompts. instead we should measure cache read input tokens / total tokens at the end of previous request to have higher signal into true cache reuse and more easily see when we invalidate the prompt prefix cache.

why doesn’t it look this way?

APIs were designed for chatbot applications.

NOTE: this section is entirely speculative

engineers at these labs are smarter than me so somebody has to know this. Just incentives make it so

everyone is pretty much standardized on three api formats: OpenAI completions and responses and Anthropic’s messages.

so other inference providers just use the same standard.

so until someone implements this then nobody has to. but as soon as one does then everyone does. since its hard to compete with a >10% discount.

Their internal inference engines probably already do this for their internal inference.

if anthropic doesn’t already do output KV reuse internally then this is quite literally a compute multiplier for them.

note this may not save compute at all because it is almost guaranteed that providers already do this anyway.

While it is possible that providers already do this under the hood but since that would be a PR nightmare I doubt it.

how can API providers fix this?

they need to proide a way for users to request that the output token blocks are retained in the prompt prefix cache.

For providers with implicit caching like OpenAI and Gemini they can make this change entirely on the backend since they already handle caching for their users.

For providers with explicit caching like Anthropic this requires an update to the API.

proposed Anthropic api implementation

if people are already using automatic caching, Anthropic could extend their top level cache_control object with an optional key.

# call
response = client.messages.create(
    model="claude-fable-5",
    max_tokens=1024,
    ####################################
    cache_control={ # existing automatic caching param
        "cache_output": True, # new, indicates to cache output
        "type": "ephemeral" # or e.g. "intelligent" to only retain cache if stop_reason == "tool_use"
        "ttl": "5m"
    },
    ####################################
    system="You are a helpful assistant.",
    messages=[
        {
            "role": "user",
            "content": "do something agentic (call tools in a loop for me) please",
        }
    ],
)

# response
{
  "content": ...
  "usage": {
    "input_tokens": 2048,
    "cache_read_input_tokens": 1800,
    "cache_creation_input_tokens": 248,
    "output_tokens": 503,
    "cache_creation_output_tokens" 503, # new
  }
}

proposed new pricing model

Again, providers where caching is already priced in9 don’t need to change anything.

however anthropic charges extra for storing the KV cache between requests. they charge 25% of the price of input tokens to store these in cache for up to 5 minutes. essentially this is a flat cost paid per token when that token is retained in the prompt prefix cache between api requests. currently it is only applied to input tokens but since there is no fundamental difference between caching input and output tokens it would make sense to have a single price for caching any type of token.

If they continue with the same pricing model they would likely need to add something akin to a cache_retention_per_1M_tokens. lets assume this would be at the same 0.25x input token price it is currenlty.

so for Fable 5 this would be $10.00 * 0.25 = $2.50 per 1M tok

output + cache WRITE tax + cache READ input = LIKELY cost of output tokens
$50.00 + $2.50           + $1.00            = $53.50 / 1M tokens

$62.50 - $53.50 = $9.00 saved per 1M output tokens
$53.50 / $62.50 = 0.856 => 14.4% savings

Not quite as good as the 18.4% from above but still a free ~15% savings.

what about the final assistant message at the end of a turn?

as we discussed earlier, we expect every assistant message with tool calls to have a follow up api request, with tool results, before the cache expiration.

however eventually the loop stops and we don’t want to pay to cache those final output tokens do we?

it depends … on whether or not the user will send a follow up message/request before the cache retention period expries

so how do we know when to cache?

… we dont

… but the llm does

Notice how whether or not we want to cache depends on what the LLM decides to do.

If the llm calls a tool we want to cache. If the llm does not then we probably don’t since we have no guarantee of an subsequent api request that will make use of whatever we just cached.

and it turns out that implementing this is pretty simple: providers can distinguish between these two cases based on the stop_reason for why decode ended. Cache when stop_reason == "tool_use" and don’t cache when stop_reason in ("end_turn", "refusal").

Notice that this doesn’t just have to apply to the proposed cache write output tokens. we can apply this conditional cache retention to all new tokens in this request.

this would be huge because it would allow callers to not request a cache write for any new tokens (input and output) if there aren’t any tool calls.

appendix

evolution of LLM API usage

I don’t believe that prompt caching APIs are purposely built to charge you twice for the current most common usage pattern (append-only agentic inference). The issue is that common usage patterns have shifted significantly since the APIs were first designed.

prompt caching was built for chatbots, not agents10

prompt caching APIs were initially designed for chat applications (e.g ChatGPT) where it allowed users of the api to reuse the cache for the system message across chats for all users.

legend

cache write; written to cache; cache read; new this step / not cached;

Chat 1:

request 1

system message and tool definitions You are a helpful assistant $12.50 / 1M

user message How many r’s are in “strawberry”? $10.00 / 1M

result 1

system message and tool definitions You are a helpful assistant $12.50 / 1M

user message How many r’s are in “strawberry”? $12.50 / 1M

assistant message There are 3 r’s in “strawberry” $50.00 / 1M

neither user or assistant message is retained in the prompt

Chat 2:

could be the same user or a different user

request 2

system message and tool definitions You are a helpful assistant $1.00 / 1M

user message Which is greater, 9.9 or 9.11? $10.00 / 1M

result 2

system message and tool definitions You are a helpful assistant $1.00 / 1M

user message Which is greater, 9.9 or 9.11? $10.00 / 1M

assistant message 9.9 is greater than 9.11 $50.00 / 1M

another chat can reuse the cached prompt prefix (e.g. the shared system message)

multi-turn conversations: what if the user asks a follow up question?

prompt caching works alright for this as well. but tbh we could have seen this issue back then. both SGLang and vLLm did.

legend

cache write; written to cache; cache read; new this step;

request 1

system message and tool definitions You are a helpful assistant. $12.50 / 1M

user message How many r’s are in “strawberry”? $12.50 / 1M

result 1

system message and tool definitions You are a helpful assistant. $12.50 / 1M

user message How many r’s are in “strawberry”? $12.50 / 1M

assistant message There are 3 r’s in “strawberry” $50.00 / 1M

request 2

system message and tool definitions You are a helpful assistant. $1.00 / 1M

user message How many r’s are in “strawberry”? $1.00 / 1M

assistant message There are 3 r’s in “strawberry” $12.50 / 1M

user message Which is greater, 9.9 or 9.11? $12.50 / 1M

result 2

system message and tool definitions You are a helpful assistant. $1.00 / 1M

user message How many r’s are in “strawberry”? $1.00 / 1M

assistant message There are 3 r’s in “strawberry” $12.50 / 1M

user message Which is greater, 9.9 or 9.11? $12.50 / 1M

assistant message 9.9 is greater than 9.11 $50.00 / 1M

since the assistant message wasn’t cached after the first result you have to pay cache write input price for tokens you already paid full output price for

notice that for each request after the first we are writing the Assistant message from the previous turn to the cache.

when should I write to the cache?

if you use the cache even once it is worth the price

cache_write = 1.25 x input cache_read = 0.1 x input

so if you write 1000 tokens to the cache you pay for 1250 tokens then on next api call, with the same prefix, you get a cache hit and read all those 1000 cached tokens at the price of 100 tokens. so you paid for the equivalent of 1350 input tokens.

however if you don’t cache you pay regular price for 1000 tokens then if you make a request with the same prefix you pay regular price again for 1000 tokens. so on the second request you already paid more that if you had cached. not to mention that caching also improves api response time as well.

my takeaway from this is:

if you are going to make another llm api call with the same prompt prefix, in the next 5 minutes, you should write that prefix to cache

in vanilla11 multi-turn conversation it is unrealistic to predict whether or not the user will ask a follow up question ex ante so builders on the api either always or never pay the cache write cost depending on their apps usage patterns.

notice how with the shift from vanilla multi-turn conversations to agent loops it actually got easier to predict whether or not we will reuse the cache: if the assistant calls a tool in its reponse then we know that we will reuse the cache in the very next request. This is why builders now always pay the cache write cost.

minimal append-only agent loop

messages: list[Message] = [
    SystemMessage(content="You are a helpful assistant."),
    UserMessage(content="What is the weather in Tokyo?"),
]

while True:
    assistant_message: AssistantMessage = llm.generate(messages)
    messages.append(assistant_message.content)

    # if there are no tool calls, we are done
    if not assistant_message.tool_calls:
        return assistant_message.content

    # otherwise execute tool calls, append tool results and repeat
    for tool_call in assistant_message.tool_calls:
        tool_result: ToolResultMessage = execute_tool(tool_call)
        messages.append(tool_result.content)