Cut cost with prefix reuse
Context cache caches the common prefix of overlapping requests to avoid redundant computation. This improves response speed and reduces costs without affecting response quality.
Context cache provides three modes to suit different scenarios. Choose based on your requirements for convenience, certainty, and cost:
Compared to implicit cache, explicit cache requires you to manually create it and incurs additional overhead. However, it provides a higher cache hit rate and lower access latency.
Add the
Qwen-Max: qwen3-max
Qwen-Plus: qwen3.6-plus, qwen3.5-plus, qwen-plus
Qwen-Flash: qwen3.5-flash, qwen-flash
Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
The following examples show how cache blocks are created and hit.
Subsequent requests about the same code repository reuse this cache block, resulting in faster responses and lower costs.
In complex scenarios, a prompt often consists of multiple parts with different reuse frequencies. You can use multiple cache markers to achieve fine-grained control.
For example, the prompt for a smart customer service agent typically includes:
Explicit cache affects only the pricing of input tokens. The rules are as follows:
Only the following message types in the
This structure also applies to other message types in the
This example caches the code repository content as a prefix. Subsequent requests ask different questions about the repository.
Text generation
When you send a request to a model that supports implicit cache, the feature is automatically enabled. The system works as follows:
Implicit cache identifies duplicate prefix content across different requests. To increase the hit rate, place static content at the beginning of the prompt and variable content at the end.
There is no additional fee for implicit cache.
When a request hits the cache, the matched input tokens are billed as
You can retrieve the number of hit cache tokens from the
Check hit cache tokens in
Check hit cache tokens in
Message array for subsequent requests
Although the questions differ, they reference the same article. Because the system prompt and article content remain unchanged, each request shares a large overlapping prefix, increasing the likelihood of a cache hit.
Message array for the second turn of conversation
As the number of conversation turns increases, the benefits of caching -- faster inference and lower cost -- become more pronounced.
With context cache, the system can respond quickly once a cache hit occurs, even if the user frequently changes the type of product they are asking about, such as from a smartwatch to a laptop.
Session cache is a cache mode for multi-turn conversation scenarios that use the Responses API. Unlike explicit cache, which requires you to manually add a
Add the following field to the request header to control the session cache:
Second-turn response example (cache hit)
The
The
The billing rules for session cache are the same as for explicit cache:
You cannot disable it. Implicit cache is automatically enabled for all supported models because it does not affect response quality and reduces costs while improving response speed when the cache is hit.
There are several possible reasons:
Yes. Each hit resets the validity period of that cache block to 5 minutes.
No. Both implicit and explicit cache data are isolated at the account level and are not shared across accounts.
No. Cache data is isolated between models and is not shared.
To ensure model output quality, the backend service appends a small number of tokens (usually fewer than 10) after the user-provided prompt. These tokens are placed after the
- Explicit cache: A cache mode that you need to manually enable. Create a cache for specific content to achieve a guaranteed hit. The cache is valid for 5 minutes. Tokens used to create the cache are billed at 125% of the standard input token price. Subsequent hits are billed at 10% of the standard price.
- Implicit cache: An automatic mode that requires no extra configuration and cannot be disabled. It is suitable for general scenarios where convenience is the priority. The system automatically identifies and caches the common prefix of requests, but the hit rate is not guaranteed. The cached portion that is hit is billed at 20% of the standard input token price.
- Session cache: Designed for multi-turn conversation scenarios that use the Responses API. By adding
x-dashscope-session-cache: enableto the request header, the server automatically caches the conversation context. The billing rules are the same as for explicit cache: tokens used to create the cache are billed at 125% of the standard input token price, and hits are billed at 10%.
| Item | Explicit cache | Implicit cache | Session cache |
|---|---|---|---|
| Affects response quality | No impact | No impact | No impact |
| Billing for tokens used to create the cache | 125% of the standard input token price | 100% of the standard input token price | 125% of the standard input token price |
| Billing for cached input tokens that are hit | 10% of the standard input token price | 20% of the standard input token price | 10% of the standard input token price |
| Minimum tokens for caching | 1024 | 256 | 1024 |
| Cache validity period | 5 minutes (resets on hit) | Not guaranteed. The system periodically clears unused cached data. | 5 minutes (resets on hit) |
- When you use the Chat Completions API or DashScope API, explicit cache and implicit cache are mutually exclusive. A single request can use only one of these modes.
- When you use the Responses API, if session cache is not enabled, implicit cache is used if the model supports it.
Explicit cache
Compared to implicit cache, explicit cache requires you to manually create it and incurs additional overhead. However, it provides a higher cache hit rate and lower access latency.
Usage
Add the "cache_control": {"type": "ephemeral"} marker in the messages array. The system then searches backward from the position of each cache_control marker for up to 20 content blocks and attempts to match against existing cache blocks.
A single request supports a maximum of four cache markers.
-
Cache miss
The system creates a new cache block using the content from the beginning of the
messagesarray to thecache_controlmarker. The cache block is valid for 5 minutes.- Cache creation occurs after the model responds. We recommend that you wait until the creation request is complete before sending subsequent requests.
- A cache block must contain at least 1024 tokens.
- Cache hit The system selects the longest matching prefix as the hit cache block and resets the validity period of that block to 5 minutes.
1
Send the first request
Send a system message that contains text A with more than 1024 tokens and add a cache marker.The system creates the first cache block, which is referred to as cache block A.
2
Send the second request
Send a request with the following structure:
- If "other messages" contains no more than 20 messages, cache block A is hit, and its validity period is reset to 5 minutes. The system also creates a new cache block based on A, the other messages, and B.
- If "other messages" contains more than 20 messages, cache block A is not hit. The system creates a new cache block based on the full context (A + other messages + B).
Supported models
Qwen-Max: qwen3-max
Qwen-Plus: qwen3.6-plus, qwen3.5-plus, qwen-plus
Qwen-Flash: qwen3.5-flash, qwen-flash
Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
Getting started
The following examples show how cache blocks are created and hit.
Use multiple cache markers for fine-grained control
In complex scenarios, a prompt often consists of multiple parts with different reuse frequencies. You can use multiple cache markers to achieve fine-grained control.
For example, the prompt for a smart customer service agent typically includes:
- System settings: Highly stable and almost never changes.
- External knowledge: Semi-stable. It is retrieved from a knowledge base or by calling a tool and may remain unchanged during a continuous conversation.
- Conversation history: Grows dynamically.
- Current question: Different each time.
Billing
Explicit cache affects only the pricing of input tokens. The rules are as follows:
-
Cache creation: Newly created cache content is billed at 125% of the standard input price. If the new cache content contains an existing cache as a prefix, only the incremental portion is billed for creation (new cache tokens minus existing cache tokens).
For example, if you have an existing cache A with 1200 tokens and a new request needs to cache content AB with 1500 tokens, the first 1200 tokens are billed as a cache hit (10% of the standard price), and the new 300 tokens are billed for cache creation (125% of the standard price).
Check the number of tokens used for cache creation in the
cache_creation_input_tokensparameter. -
Cache hit: Billed at 10% of the standard input price.
Check the number of hit cache tokens in the
cached_tokensparameter. - Other tokens: Tokens that do not match any cache and are not used to create a cache are billed at the standard price.
Cacheable content
Only the following message types in the messages array support adding cache markers:
- System message
- User message
- Assistant message
-
Tool message (the result after a tool is executed)
If a request includes the
toolsparameter, adding a cache marker inmessagesalso caches the tool descriptions defined in the request.
content field to an array and add the cache_control field:
messages array.
Cache limits
- The minimum cacheable prompt length is 1024 tokens.
-
The cache uses a backward prefix matching strategy. The system automatically checks the last 20 content blocks. If the content to be matched is separated from the message with the
cache_controlmarker by more than 20 content blocks, the cache is not hit. -
Setting
typetoephemeralis the only supported option. This sets a validity period of 5 minutes. -
A single request can have a maximum of 4 cache markers.
If the number of cache markers is greater than four, only the last four cache markers take effect.
Usage examples
Different questions for a long text
Different questions for a long text
To ensure model performance, the system appends a small number of internal tokens. These tokens are billed at the standard input price. For more information, see the FAQ.
Continuous multi-turn conversation
Continuous multi-turn conversation
In a daily chat multi-turn conversation scenario, you can add a cache marker to the last content of the messages array in each request. Starting from the second turn of the conversation, each request will hit and refresh the cache block created in the previous turn, and will also create a new cache block.Run the code above and enter questions to communicate with the model. Each question will hit the cache block created in the previous turn.
Implicit cache
Supported models
Snapshot models and models with the
-latest suffix are not supported.- Qwen-Max: qwen3-max, qwen3-max-preview, and qwen-max
- Qwen-Plus: qwen-plus
- Qwen-Flash: qwen-flash
- Qwen-Turbo: qwen-turbo
- Qwen-Coder: qwen3-coder-plus, qwen3-coder-flash
- Qwen-VL: qwen3-vl-plus, qwen3-vl-flash, qwen-vl-max, qwen-vl-plus
- Role playing: qwen-plus-character, qwen-flash-character, qwen-plus-character-ja
How it works
When you send a request to a model that supports implicit cache, the feature is automatically enabled. The system works as follows:
- Find: Upon receiving a request, the system checks the cache for a common prefix of the content in the request's
messagesarray based on the prefix matching principle. - Evaluate:
- If the cache is hit, the system directly uses the cached result for the subsequent part of the inference process.
- If the cache is missed, the system processes the request normally and stores the prefix of the current prompt in the cache for subsequent requests to use.
The system periodically clears unused cached data, and content with fewer than 256 tokens is not cached. The hit rate is not guaranteed to be 100% — a cache miss may occur even with identical request context.
Increase hit rate
Implicit cache identifies duplicate prefix content across different requests. To increase the hit rate, place static content at the beginning of the prompt and variable content at the end.
- Text-only: If the system has cached "ABCD", a request for "ABE" can match the "AB" prefix, while a request for "BCD" will not match any cache.
- Visual understanding:
- When asking multiple questions about the same image or video: Place the image or video before the text to increase the hit rate.
- When asking the same question about different images or videos: Place the text before the image or video to increase the hit rate.
Billing
There is no additional fee for implicit cache.
When a request hits the cache, the matched input tokens are billed as cached_token at 20% of the input_token unit price. Input tokens that do not hit the cache are billed at the standard input_token price. Output tokens are billed at the standard price.
Example: A request contains 10,000 input tokens, of which 5,000 hit the cache:
- Non-hit tokens (5,000): Billed at 100% of the unit price
- Hit tokens (5,000): Billed at 20% of the unit price

cached_tokens attribute of the returned result.
OpenAI compatible-Batch (file input) method is not eligible for cache discounts.
Cache hit examples
Text generation
Check hit cache tokens in usage.prompt_tokens_details.cached_tokens (part of usage.prompt_tokens for OpenAI compatible or usage.input_tokens for DashScope).
Visual understanding
Check hit cache tokens in usage.prompt_tokens_details.cached_tokens (part of usage.prompt_tokens for OpenAI compatible) or usage.cached_tokens (part of usage.input_tokens for DashScope).
Models that currently use
usage.cached_tokens will be upgraded to usage.prompt_tokens_details.cached_tokens in the future.Typical scenarios
- Q&A based on long text This is applicable to scenarios that require multiple requests about a fixed long text, such as a novel, textbook, or legal document. Message array for the first request
- Code auto-completion In code auto-completion scenarios, the model auto-completes code based on the existing context. As the user continues coding, the prefix portion of the code remains unchanged. Context cache can cache the preceding code to improve completion speed.
- Multi-turn conversation In a multi-turn conversation, the conversation history from all previous turns is included in the messages array. Each turn's request therefore shares the same prefix as the previous turn, resulting in a high probability of a cache hit. Message array for the first turn of conversation
- Role playing or few-shot learning In role playing or few-shot learning scenarios, you typically include a large amount of information in the prompt to guide the model's output format. This creates a large shared prefix across different requests. For example, if you want the model to act as a marketing expert, the system prompt contains a large amount of text information. The following are message examples for two requests:
-
Video understanding
In video understanding scenarios, if you ask multiple questions about the same video, placing the
videobefore thetextincreases the probability of a cache hit. If you ask the same question about different videos, placing thetextbefore thevideoincreases the probability of a cache hit. The following is a message example for two requests for the same video:
Session cache
Overview
Session cache is a cache mode for multi-turn conversation scenarios that use the Responses API. Unlike explicit cache, which requires you to manually add a cache_control marker, session cache handles the caching logic automatically on the server-side. You only need to enable or disable it through an HTTP header and make calls as you would for a normal multi-turn conversation.
When you use
previous_response_id for a multi-turn conversation, enabling session cache allows the server-side to automatically cache the conversation context, which reduces inference latency and usage costs.Usage
Add the following field to the request header to control the session cache:
x-dashscope-session-cache: enable: Enables session cache.x-dashscope-session-cache: disable: Disables session cache. If the model supports implicit cache, it is used instead.
default_headers (Python) or defaultHeaders (Node.js) parameter. When you use curl, pass it with the -H parameter.
Supported models
qwen3-max, qwen3.6-plus, qwen3.5-plus, qwen3.5-flash, qwen-plus, qwen-flash, qwen3-coder-plus, qwen3-coder-flash
Session cache is applicable only to the Responses API (OpenAI Responses) and not to the Chat Completions API.
Code examples
cached_tokens field shows the number of hit cache tokens:
input_tokens for the second turn of the conversation is 1524, of which the cached_tokens is 1305. This indicates that the context from the first turn was a cache hit, which can effectively reduce inference latency and cost.
Billing
The billing rules for session cache are the same as for explicit cache:
- Cache creation: Billed at 125% of the standard input token price.
-
Cache hit: Billed at 10% of the standard input token price.
The number of hit cache tokens can be viewed in the
usage.input_tokens_details.cached_tokensparameter. - Other tokens: Tokens that are not hit and not used to create a cache are billed at the original price.
Limitations
- The minimum cacheable prompt length is 1024 tokens.
- The cache validity period is 5 minutes and is reset upon a hit.
- It is applicable only to the Responses API and must be used with the
previous_response_idparameter for multi-turn conversations. - Session cache is mutually exclusive with explicit cache and implicit cache. When enabled, the other two modes do not take effect.
FAQ
How do I disable implicit cache?
You cannot disable it. Implicit cache is automatically enabled for all supported models because it does not affect response quality and reduces costs while improving response speed when the cache is hit.
Why was the explicit cache not hit after I created it?
There are several possible reasons:
- The cache was not hit within 5 minutes of creation, and the system cleared the cache block after it expired.
- If the last
contentblock is separated from the existing cache block by more than 20contentblocks, the cache is not hit. We recommend that you create a new cache block.
Does hitting the explicit cache reset its validity period?
Yes. Each hit resets the validity period of that cache block to 5 minutes.
Is the explicit cache shared between different accounts?
No. Both implicit and explicit cache data are isolated at the account level and are not shared across accounts.
Is explicit cache shared between different models?
No. Cache data is isolated between models and is not shared.
Why doesn't input_tokens equal cache_creation_input_tokens + cached_tokens?
To ensure model output quality, the backend service appends a small number of tokens (usually fewer than 10) after the user-provided prompt. These tokens are placed after the cache_control marker, so they are not counted for cache creation or hits. However, they are included in the total input_tokens.