Skip to main content
Context management

Context cache

Cut cost with prefix reuse

Context cache caches the common prefix of overlapping requests to avoid redundant computation. This improves response speed and reduces costs without affecting response quality. Context cache provides three modes to suit different scenarios. Choose based on your requirements for convenience, certainty, and cost:
  • Explicit cache: A cache mode that you need to manually enable. Create a cache for specific content to achieve a guaranteed hit. The cache is valid for 5 minutes. Tokens used to create the cache are billed at 125% of the standard input token price. Subsequent hits are billed at 10% of the standard price.
  • Implicit cache: An automatic mode that requires no extra configuration and cannot be disabled. It is suitable for general scenarios where convenience is the priority. The system automatically identifies and caches the common prefix of requests, but the hit rate is not guaranteed. The cached portion that is hit is billed at 20% of the standard input token price.
  • Session cache: Designed for multi-turn conversation scenarios that use the Responses API. By adding x-dashscope-session-cache: enable to the request header, the server automatically caches the conversation context. The billing rules are the same as for explicit cache: tokens used to create the cache are billed at 125% of the standard input token price, and hits are billed at 10%.
ItemExplicit cacheImplicit cacheSession cache
Affects response qualityNo impactNo impactNo impact
Billing for tokens used to create the cache125% of the standard input token price100% of the standard input token price125% of the standard input token price
Billing for cached input tokens that are hit10% of the standard input token price20% of the standard input token price10% of the standard input token price
Minimum tokens for caching10242561024
Cache validity period5 minutes (resets on hit)Not guaranteed. The system periodically clears unused cached data.5 minutes (resets on hit)
  • When you use the Chat Completions API or DashScope API, explicit cache and implicit cache are mutually exclusive. A single request can use only one of these modes.
  • When you use the Responses API, if session cache is not enabled, implicit cache is used if the model supports it.

Explicit cache

Compared to implicit cache, explicit cache requires you to manually create it and incurs additional overhead. However, it provides a higher cache hit rate and lower access latency.

Usage

Add the "cache_control": {"type": "ephemeral"} marker in the messages array. The system then searches backward from the position of each cache_control marker for up to 20 content blocks and attempts to match against existing cache blocks.
A single request supports a maximum of four cache markers.
  • Cache miss The system creates a new cache block using the content from the beginning of the messages array to the cache_control marker. The cache block is valid for 5 minutes.
    • Cache creation occurs after the model responds. We recommend that you wait until the creation request is complete before sending subsequent requests.
    • A cache block must contain at least 1024 tokens.
  • Cache hit The system selects the longest matching prefix as the hit cache block and resets the validity period of that block to 5 minutes.
Example:
1

Send the first request

Send a system message that contains text A with more than 1024 tokens and add a cache marker.
[{"role": "system", "content": [{"type": "text", "text": "A", "cache_control": {"type": "ephemeral"}}]}]
The system creates the first cache block, which is referred to as cache block A.
2

Send the second request

Send a request with the following structure:
[
  {"role": "system", "content": "A"},
  // <other messages>
  {"role": "user","content": [{"type": "text", "text": "B", "cache_control": {"type": "ephemeral"}}]}
]
  • If "other messages" contains no more than 20 messages, cache block A is hit, and its validity period is reset to 5 minutes. The system also creates a new cache block based on A, the other messages, and B.
  • If "other messages" contains more than 20 messages, cache block A is not hit. The system creates a new cache block based on the full context (A + other messages + B).

Supported models

Qwen-Max: qwen3-max Qwen-Plus: qwen3.6-plus, qwen3.5-plus, qwen-plus Qwen-Flash: qwen3.5-flash, qwen-flash Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash

Getting started

The following examples show how cache blocks are created and hit.
from openai import OpenAI
import os

client = OpenAI(
  # If the environment variable is not set, replace the following line with: api_key="sk-xxx"
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# Mock code repository content. The minimum cacheable prompt length is 1024 tokens.
long_text_content = "<Your Code Here>" * 400

# Function to send the request
def get_completion(user_input):
  messages = [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": long_text_content,
          # Place the cache_control marker here to create a cache block containing all content from the beginning of the messages array to this content block.
          "cache_control": {"type": "ephemeral"},
        }
      ],
    },
    # The question content is different for each request.
    {
      "role": "user",
      "content": user_input,
    },
  ]
  completion = client.chat.completions.create(
    # Select a model that supports explicit cache.
    model="qwen3-coder-plus",
    messages=messages,
  )
  return completion

# First request
first_completion = get_completion("What is the content of this code?")
print(f"First request cache creation tokens: {first_completion.usage.prompt_tokens_details.cache_creation_input_tokens}")
print(f"First request cached hit tokens: {first_completion.usage.prompt_tokens_details.cached_tokens}")
print("=" * 20)
# Second request, the code content is the same, only the question is changed.
second_completion = get_completion("How can this code be optimized?")
print(f"Second request cache creation tokens: {second_completion.usage.prompt_tokens_details.cache_creation_input_tokens}")
print(f"Second request cached hit tokens: {second_completion.usage.prompt_tokens_details.cached_tokens}")
Subsequent requests about the same code repository reuse this cache block, resulting in faster responses and lower costs.
First request cache creation tokens: 1605
First request cached hit tokens: 0
====================
Second request cache creation tokens: 0
Second request cached hit tokens: 1605

Use multiple cache markers for fine-grained control

In complex scenarios, a prompt often consists of multiple parts with different reuse frequencies. You can use multiple cache markers to achieve fine-grained control. For example, the prompt for a smart customer service agent typically includes:
  • System settings: Highly stable and almost never changes.
  • External knowledge: Semi-stable. It is retrieved from a knowledge base or by calling a tool and may remain unchanged during a continuous conversation.
  • Conversation history: Grows dynamically.
  • Current question: Different each time.
If you cache the entire prompt as a single unit, any minor change, such as a change in external knowledge, invalidates the cache. You can set up to four cache markers in a request to create separate cache blocks for different parts of the prompt. This improves the hit rate and provides fine-grained control.

Billing

Explicit cache affects only the pricing of input tokens. The rules are as follows:
  • Cache creation: Newly created cache content is billed at 125% of the standard input price. If the new cache content contains an existing cache as a prefix, only the incremental portion is billed for creation (new cache tokens minus existing cache tokens). For example, if you have an existing cache A with 1200 tokens and a new request needs to cache content AB with 1500 tokens, the first 1200 tokens are billed as a cache hit (10% of the standard price), and the new 300 tokens are billed for cache creation (125% of the standard price).
    Check the number of tokens used for cache creation in the cache_creation_input_tokens parameter.
  • Cache hit: Billed at 10% of the standard input price.
    Check the number of hit cache tokens in the cached_tokens parameter.
  • Other tokens: Tokens that do not match any cache and are not used to create a cache are billed at the standard price.

Cacheable content

Only the following message types in the messages array support adding cache markers:
  • System message
  • User message
  • Assistant message
  • Tool message (the result after a tool is executed)
    If a request includes the tools parameter, adding a cache marker in messages also caches the tool descriptions defined in the request.
For example, for a system message, change the content field to an array and add the cache_control field:
{
  "role": "system",
  "content": [
    {
      "type": "text",
      "text": "<Your specified prompt>",
      "cache_control": {
        "type": "ephemeral"
      }
    }
  ]
}
This structure also applies to other message types in the messages array.

Cache limits

  • The minimum cacheable prompt length is 1024 tokens.
  • The cache uses a backward prefix matching strategy. The system automatically checks the last 20 content blocks. If the content to be matched is separated from the message with the cache_control marker by more than 20 content blocks, the cache is not hit.
  • Setting type to ephemeral is the only supported option. This sets a validity period of 5 minutes.
  • A single request can have a maximum of 4 cache markers.
    If the number of cache markers is greater than four, only the last four cache markers take effect.

Usage examples

from openai import OpenAI
import os

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# Mock code repository content
long_text_content = "<Your Code Here>" * 400

# Function to send the request
def get_completion(user_input):
  messages = [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": long_text_content,
          # Place the cache_control marker here to create a cache from the beginning of the prompt to the end of this content block (the mock code repository content).
          "cache_control": {"type": "ephemeral"},
        }
      ],
    },
    {
      "role": "user",
      "content": user_input,
    },
  ]
  completion = client.chat.completions.create(
    # Select a model that supports explicit cache.
    model="qwen3-coder-plus",
    messages=messages,
  )
  return completion

# First request
first_completion = get_completion("What is the content of this code?")
created_cache_tokens = first_completion.usage.prompt_tokens_details.cache_creation_input_tokens
print(f"First request cache creation tokens: {created_cache_tokens}")
hit_cached_tokens = first_completion.usage.prompt_tokens_details.cached_tokens
print(f"First request cached hit tokens: {hit_cached_tokens}")
print(f"First request tokens not hit and not cached: {first_completion.usage.prompt_tokens-created_cache_tokens-hit_cached_tokens}")
print("=" * 20)
# Second request, the code content is the same, only the question is changed.
second_completion = get_completion("What are some areas for optimization in this code?")
created_cache_tokens = second_completion.usage.prompt_tokens_details.cache_creation_input_tokens
print(f"Second request cache creation tokens: {created_cache_tokens}")
hit_cached_tokens = second_completion.usage.prompt_tokens_details.cached_tokens
print(f"Second request cached hit tokens: {hit_cached_tokens}")
print(f"Second request tokens not hit and not cached: {second_completion.usage.prompt_tokens-created_cache_tokens-hit_cached_tokens}")
This example caches the code repository content as a prefix. Subsequent requests ask different questions about the repository.
First request cache creation tokens: 1605
First request cached hit tokens: 0
First request tokens not hit and not cached: 13
====================
Second request cache creation tokens: 0
Second request cached hit tokens: 1605
Second request tokens not hit and not cached: 15
To ensure model performance, the system appends a small number of internal tokens. These tokens are billed at the standard input price. For more information, see the FAQ.
In a daily chat multi-turn conversation scenario, you can add a cache marker to the last content of the messages array in each request. Starting from the second turn of the conversation, each request will hit and refresh the cache block created in the previous turn, and will also create a new cache block.
from openai import OpenAI
import os

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

system_prompt = "You are a witty person." * 400
messages = [{"role": "system", "content": system_prompt}]

def get_completion(messages):
  completion = client.chat.completions.create(
    model="qwen3-coder-plus",
    messages=messages,
  )
  return completion

while True:
  user_input = input("Please enter: ")
  messages.append({"role": "user", "content": [{"type": "text", "text": user_input, "cache_control": {"type": "ephemeral"}}]})
  completion = get_completion(messages)
  print(f"[AI Response] {completion.choices[0].message.content}")
  messages.append(completion.choices[0].message)
  created_cache_tokens = completion.usage.prompt_tokens_details.cache_creation_input_tokens
  hit_cached_tokens = completion.usage.prompt_tokens_details.cached_tokens
  uncached_tokens = completion.usage.prompt_tokens - created_cache_tokens - hit_cached_tokens
  print(f"[Cache Info] Cache creation tokens: {created_cache_tokens}")
  print(f"[Cache Info] Cached hit tokens: {hit_cached_tokens}")
  print(f"[Cache Info] Tokens not hit and not cached: {uncached_tokens}")
Run the code above and enter questions to communicate with the model. Each question will hit the cache block created in the previous turn.

Implicit cache

Supported models

Snapshot models and models with the -latest suffix are not supported.
Text generation Visual understanding
  • Qwen-VL: qwen3-vl-plus, qwen3-vl-flash, qwen-vl-max, qwen-vl-plus
Domain specific
  • Role playing: qwen-plus-character, qwen-flash-character, qwen-plus-character-ja

How it works

When you send a request to a model that supports implicit cache, the feature is automatically enabled. The system works as follows:
  1. Find: Upon receiving a request, the system checks the cache for a common prefix of the content in the request's messages array based on the prefix matching principle.
  2. Evaluate:
    • If the cache is hit, the system directly uses the cached result for the subsequent part of the inference process.
    • If the cache is missed, the system processes the request normally and stores the prefix of the current prompt in the cache for subsequent requests to use.
The system periodically clears unused cached data, and content with fewer than 256 tokens is not cached. The hit rate is not guaranteed to be 100% — a cache miss may occur even with identical request context.

Increase hit rate

Implicit cache identifies duplicate prefix content across different requests. To increase the hit rate, place static content at the beginning of the prompt and variable content at the end.
  • Text-only: If the system has cached "ABCD", a request for "ABE" can match the "AB" prefix, while a request for "BCD" will not match any cache.
  • Visual understanding:
    • When asking multiple questions about the same image or video: Place the image or video before the text to increase the hit rate.
    • When asking the same question about different images or videos: Place the text before the image or video to increase the hit rate.

Billing

There is no additional fee for implicit cache. When a request hits the cache, the matched input tokens are billed as cached_token at 20% of the input_token unit price. Input tokens that do not hit the cache are billed at the standard input_token price. Output tokens are billed at the standard price. Example: A request contains 10,000 input tokens, of which 5,000 hit the cache:
  • Non-hit tokens (5,000): Billed at 100% of the unit price
  • Hit tokens (5,000): Billed at 20% of the unit price
The total input cost is equivalent to 60% of the cost without a cache: (50% x 100%) + (50% x 20%) = 60%.
image.png
You can retrieve the number of hit cache tokens from the cached_tokens attribute of the returned result.
OpenAI compatible-Batch (file input) method is not eligible for cache discounts.

Cache hit examples

Text generation

Check hit cache tokens in usage.prompt_tokens_details.cached_tokens (part of usage.prompt_tokens for OpenAI compatible or usage.input_tokens for DashScope).
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "I am a super-large language model developed by Alibaba Cloud. My name is Qwen."
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 3019,
    "completion_tokens": 104,
    "total_tokens": 3123,
    "prompt_tokens_details": {
      "cached_tokens": 2048
    }
  },
  "created": 1735120033,
  "system_fingerprint": null,
  "model": "qwen-plus",
  "id": "chatcmpl-6ada9ed2-7f33-9de2-8bb0-78bd4035025a"
}

Visual understanding

Check hit cache tokens in usage.prompt_tokens_details.cached_tokens (part of usage.prompt_tokens for OpenAI compatible) or usage.cached_tokens (part of usage.input_tokens for DashScope).
Models that currently use usage.cached_tokens will be upgraded to usage.prompt_tokens_details.cached_tokens in the future.
{
  "id": "chatcmpl-3f3bf7d0-b168-9637-a245-dd0f946c700f",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "This image shows a heartwarming scene of a woman and a dog interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, smiling as she interacts with the dog. The dog is a large, light-colored breed with a colorful collar, and its front paw is raised as if to shake hands or give a high-five to the woman. The background is a vast ocean and sky, with sunlight shining from the right side of the frame, adding a warm and peaceful atmosphere to the whole scene.",
        "refusal": null,
        "role": "assistant",
        "audio": null,
        "function_call": null,
        "tool_calls": null
      }
    }
  ],
  "created": 1744956927,
  "model": "qwen3-vl-plus",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 93,
    "prompt_tokens": 1316,
    "total_tokens": 1409,
    "completion_tokens_details": null,
    "prompt_tokens_details": {
      "audio_tokens": null,
      "cached_tokens": 1152
    }
  }
}

Typical scenarios

  1. Q&A based on long text This is applicable to scenarios that require multiple requests about a fixed long text, such as a novel, textbook, or legal document. Message array for the first request
messages = [{"role": "system","content": "You are a language teacher and you can help students with reading comprehension."},
  {"role": "user","content": "<Article content> What is the author's main idea in this text?"}]
Message array for subsequent requests
messages = [{"role": "system","content": "You are a language teacher and you can help students with reading comprehension."},
  {"role": "user","content": "<Article content> Please analyze the third paragraph of this text."}]
Although the questions differ, they reference the same article. Because the system prompt and article content remain unchanged, each request shares a large overlapping prefix, increasing the likelihood of a cache hit.
  1. Code auto-completion In code auto-completion scenarios, the model auto-completes code based on the existing context. As the user continues coding, the prefix portion of the code remains unchanged. Context cache can cache the preceding code to improve completion speed.
  2. Multi-turn conversation In a multi-turn conversation, the conversation history from all previous turns is included in the messages array. Each turn's request therefore shares the same prefix as the previous turn, resulting in a high probability of a cache hit. Message array for the first turn of conversation
messages=[{"role": "system","content": "You are a helpful assistant."},
  {"role": "user","content": "Who are you?"}]
Message array for the second turn of conversation
messages=[{"role": "system","content": "You are a helpful assistant."},
  {"role": "user","content": "Who are you?"},
  {"role": "assistant","content": "I am Qwen, developed by Alibaba Cloud."},
  {"role": "user","content": "What can you do?"}]
As the number of conversation turns increases, the benefits of caching -- faster inference and lower cost -- become more pronounced.
  1. Role playing or few-shot learning In role playing or few-shot learning scenarios, you typically include a large amount of information in the prompt to guide the model's output format. This creates a large shared prefix across different requests. For example, if you want the model to act as a marketing expert, the system prompt contains a large amount of text information. The following are message examples for two requests:
system_prompt = """You are an experienced marketing expert. Please provide detailed marketing suggestions for different products in the following format:

1. Target audience: xxx

2. Main selling points: xxx

3. Marketing channels: xxx
...
12. Long-term development strategy: xxx

Please ensure that your suggestions are specific, actionable, and highly relevant to the product features."""

# User message for the first request, asking about a smartwatch
messages_1=[
  {"role": "system", "content": system_prompt},
  {"role": "user", "content": "Please provide marketing suggestions for a newly launched smartwatch."}
]

# User message for the second request, asking about a laptop. Because the system_prompt is the same, there is a high probability of hitting the Cache.
messages_2=[
  {"role": "system", "content": system_prompt},
  {"role": "user", "content": "Please provide marketing suggestions for a newly launched laptop."}
]
With context cache, the system can respond quickly once a cache hit occurs, even if the user frequently changes the type of product they are asking about, such as from a smartwatch to a laptop.
  1. Video understanding In video understanding scenarios, if you ask multiple questions about the same video, placing the video before the text increases the probability of a cache hit. If you ask the same question about different videos, placing the text before the video increases the probability of a cache hit. The following is a message example for two requests for the same video:
# User message for the first request, asking about the content of this video
messages1 = [
  {"role":"system","content":[{"text": "You are a helpful assistant."}]},
  {"role": "user",
    "content": [
      {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"},
      {"text": "What is the content of this video?"}
    ]
  }
]

# User message for the second request, asking about the video timestamp. Because the question is based on the same video, placing the video before the text has a high probability of hitting the Cache.
messages2 = [
  {"role":"system","content":[{"text": "You are a helpful assistant."}]},
  {"role": "user",
    "content": [
      {"video": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250328/eepdcq/phase_change_480p.mov"},
      {"text": "Please describe the series of events in the video, and output the start time (start_time), end time (end_time), and event (event) in JSON format. Do not output the ```json``` code segment."}
    ]
  }
]

Session cache

Overview

Session cache is a cache mode for multi-turn conversation scenarios that use the Responses API. Unlike explicit cache, which requires you to manually add a cache_control marker, session cache handles the caching logic automatically on the server-side. You only need to enable or disable it through an HTTP header and make calls as you would for a normal multi-turn conversation.
When you use previous_response_id for a multi-turn conversation, enabling session cache allows the server-side to automatically cache the conversation context, which reduces inference latency and usage costs.

Usage

Add the following field to the request header to control the session cache:
  • x-dashscope-session-cache: enable: Enables session cache.
  • x-dashscope-session-cache: disable: Disables session cache. If the model supports implicit cache, it is used instead.
When you use an SDK, you can pass this header through the default_headers (Python) or defaultHeaders (Node.js) parameter. When you use curl, pass it with the -H parameter.

Supported models

qwen3-max, qwen3.6-plus, qwen3.5-plus, qwen3.5-flash, qwen-plus, qwen-flash, qwen3-coder-plus, qwen3-coder-flash
Session cache is applicable only to the Responses API (OpenAI Responses) and not to the Chat Completions API.

Code examples

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/api/v2/apps/protocols/compatible-mode/v1",
  # Enable session cache via default_headers
  default_headers={"x-dashscope-session-cache": "enable"}
)

# Construct a long text exceeding 1024 tokens to ensure cache creation is triggered.
# (If it does not reach 1024 tokens, cache creation will be triggered when the accumulated conversation context exceeds 1024 tokens.)
long_context = "Artificial intelligence is an important branch of computer science, dedicated to the research and development of theories, methods, technologies, and application systems that can simulate, extend, and expand human intelligence." * 50

# First turn of conversation
response1 = client.responses.create(
  model="qwen3.6-plus",
  input=long_context + "\n\nBased on the background knowledge above, please briefly introduce the random forest algorithm in machine learning.",
)
print(f"First turn response: {response1.output_text}")

# Second turn of conversation: Associate the context via previous_response_id. The cache is handled automatically by the server-side.
response2 = client.responses.create(
  model="qwen3.6-plus",
  input="What are the main differences between it and GBDT?",
  previous_response_id=response1.id,
)
print(f"Second turn response: {response2.output_text}")

# Check the cache hit status
usage = response2.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cached hit tokens: {usage.input_tokens_details.cached_tokens}")
Second-turn response example (cache hit) The cached_tokens field shows the number of hit cache tokens:
{
  "id": "145584fd-3dce-4890-99dc-e3896d7f5a42",
  "created_at": 1772440976.0,
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "metadata": null,
  "model": "qwen3.6-plus",
  "object": "response",
  "output": [
    {
      "id": "msg_62a4e323-d78c-46c7-8469-2ad50f8af4b1",
      "type": "reasoning",
      "content": null
    },
    {
      "id": "msg_560e34a6-1bdf-42ae-993e-590b38249146",
      "content": [
        {
          "annotations": [],
          "text": "Although both Random Forest and GBDT (Gradient Boosting Decision Tree) are ensemble algorithms based on decision trees, they have the following main differences:\n\n1.  **Different Ensemble Strategies**\n    *   **Random Forest**: Based on the **Bagging** idea. Each tree is trained independently, with no dependency between them.\n    *   **GBDT**: Based on the **Boosting** idea. The trees have a strong dependency relationship, where the next tree aims to fit the residuals (negative gradient) of the previous tree's prediction.\n\n2.  **Different Training Methods**\n    *   **Random Forest**: Supports **parallel training** because the trees are independent, which usually results in higher computational efficiency.\n    *   **GBDT**: Must be **trained serially** because the next tree depends on the output of the previous one, making it inherently difficult to parallelize (although engineering implementations like XGBoost have made parallel optimizations at the feature level).\n\n3.  **Different Optimization Objectives**\n    *   **Random Forest**: Mainly reduces **variance** by averaging multiple models to prevent overfitting and improve stability.\n    *   **GBDT**: Mainly reduces **bias** by progressively correcting errors to improve the model's fitting ability and accuracy.\n\n4.  **Sensitivity to Outliers**\n    *   **Random Forest**: Relatively robust and not sensitive to outliers.\n    *   **GBDT**: More sensitive to outliers because outliers produce large residuals, which affect the fitting direction of subsequent trees.\n\nIn summary, Random Forest excels in stability and parallel efficiency, while GBDT usually performs better in terms of accuracy but is more complex to tune and slower to train.",
          "type": "output_text"
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": false,
  "temperature": null,
  "tool_choice": "auto",
  "tools": [],
  "top_p": null,
  "status": "completed",
  "usage": {
    "input_tokens": 1524,
    "input_tokens_details": {
      "cached_tokens": 1305
    },
    "output_tokens": 1534,
    "output_tokens_details": {
      "reasoning_tokens": 1187
    },
    "total_tokens": 3058
  }
}
The input_tokens for the second turn of the conversation is 1524, of which the cached_tokens is 1305. This indicates that the context from the first turn was a cache hit, which can effectively reduce inference latency and cost.

Billing

The billing rules for session cache are the same as for explicit cache:
  • Cache creation: Billed at 125% of the standard input token price.
  • Cache hit: Billed at 10% of the standard input token price.
    The number of hit cache tokens can be viewed in the usage.input_tokens_details.cached_tokens parameter.
  • Other tokens: Tokens that are not hit and not used to create a cache are billed at the original price.

Limitations

  • The minimum cacheable prompt length is 1024 tokens.
  • The cache validity period is 5 minutes and is reset upon a hit.
  • It is applicable only to the Responses API and must be used with the previous_response_id parameter for multi-turn conversations.
  • Session cache is mutually exclusive with explicit cache and implicit cache. When enabled, the other two modes do not take effect.

FAQ

How do I disable implicit cache?

You cannot disable it. Implicit cache is automatically enabled for all supported models because it does not affect response quality and reduces costs while improving response speed when the cache is hit.

Why was the explicit cache not hit after I created it?

There are several possible reasons:
  • The cache was not hit within 5 minutes of creation, and the system cleared the cache block after it expired.
  • If the last content block is separated from the existing cache block by more than 20 content blocks, the cache is not hit. We recommend that you create a new cache block.

Does hitting the explicit cache reset its validity period?

Yes. Each hit resets the validity period of that cache block to 5 minutes.

Is the explicit cache shared between different accounts?

No. Both implicit and explicit cache data are isolated at the account level and are not shared across accounts.

Is explicit cache shared between different models?

No. Cache data is isolated between models and is not shared.

Why doesn't input_tokens equal cache_creation_input_tokens + cached_tokens?

To ensure model output quality, the backend service appends a small number of tokens (usually fewer than 10) after the user-provided prompt. These tokens are placed after the cache_control marker, so they are not counted for cache creation or hits. However, they are included in the total input_tokens.