Explicit cache guarantees deterministic cache hits for identical input content by adding cache markers to your requests, significantly reducing cost and latency.
This topic describes how to use explicit cache and its best practices. By adding cache markers to your requests, explicit cache guarantees deterministic cache hits for identical input content, significantly reducing cost and latency.
The following Agent and coding tools connect to Qwen Cloud through the Anthropic protocol and natively support explicit cache. Configure them following their respective documentation, and they will automatically leverage explicit cache to optimize context management.
The following example demonstrates the basic workflow: the first request creates a cache, and the second request hits it.
Expected output:
The first request creates a cache block. The second request hits the cache because the system prompt content is identical. Cached tokens are billed at only 10% of the standard input price.
Check the
Characteristics:
Starting from the second turn, each request hits the cache from the previous turn (the conversation history), while creating a new cache that includes the current turn. The more turns in the conversation, the greater the savings.
Characteristics:
How multi-marker caching works:
Characteristics:
Characteristics:
For the list of models that support explicit cache, see Context cache.
When to use explicit cache
- You need guaranteed cache hits: Explicit cache delivers 100% deterministic hits regardless of backend resource scheduling. If your application requires stable content reuse, explicit cache is the right choice.
- You frequently reuse the same prompt: When identical or highly consistent prompts are submitted repeatedly, explicit cache significantly reduces costs. Creating the cache incurs only a 25% surcharge over the standard input price, while each subsequent hit saves 90%. A single hit is enough to break even.
- You manage long contexts in production Agents: In Agent applications, common mechanisms like compression, recap, and system reminders cause the context to change continuously. Explicit cache lets you pin and reuse key context segments so they remain cached even as the surrounding context evolves.
Agent and coding tools
The following Agent and coding tools connect to Qwen Cloud through the Anthropic protocol and natively support explicit cache. Configure them following their respective documentation, and they will automatically leverage explicit cache to optimize context management.
- Claude Code
- Open Code
- OpenClaw
- Hermes
Claude Code v2.x and later automatically includes Set the Anthropic protocol endpoint:
cache_control markers in requests (system, env, and most recent user message). No additional configuration is needed after connecting to Qwen Cloud's Anthropic-compatible endpoint.ConfigurationCreate or edit ~/.claude/settings.json (Windows: C:\Users\<username>\.claude\settings.json) with the appropriate plan settings. Alternatively, connect via environment variables:- Token Plan (Team):
https://token-plan.ap-southeast-1.maas.aliyuncs.com/apps/anthropic - Coding Plan:
https://coding-intl.dashscope.aliyuncs.com/apps/anthropic - Pay-as-you-go:
https://dashscope-intl.aliyuncs.com/apps/anthropic
API integration
Key points
- Add
"cache_control": {"type": "ephemeral"}to the message content you want to cache. All content from the beginning of the messages array up to that marker will be cached as a block. - Cached content must be at least 1024 tokens.
- A single request supports up to 4 cache markers.
- Cache TTL is 5 minutes, automatically renewed on each hit.
- Tool definitions are part of the system prompt for caching purposes. If the tools change, the cache will not hit.
Quick start
The following example demonstrates the basic workflow: the first request creates a cache, and the second request hits it.
- OpenAI compatible
- Anthropic compatible
Verify cache status
Check the usage field in the response to confirm cache behavior:
cache_creation_input_tokens: Number of tokens for which a new cache was created. A value greater than 0 means a new cache block was created.cached_tokens(OpenAI compatible) orcache_read_input_tokens(Anthropic compatible): Number of tokens that hit cache. A value greater than 0 means the cache was successfully hit.
Best practices by scenario
Multi-turn conversations
Characteristics:
- Users interact with the model over multiple turns, each request carrying the full conversation history
- Typical use cases: customer service, knowledge Q&A, code assistants
cache_control marker to the last message in each request. Each turn hits the cache created by the previous turn (the conversation history), while creating a new cache that includes the current turn for the next round.
Example:
Production Agent (multiple cache markers)
Characteristics:
- Long multi-turn conversations comprising: system prompt + skills/tools definitions + project context + user messages / tool calls
- Different sections change at different frequencies
- Typical use cases: AI coding assistants (Claude Code, OpenClaw), RAG-based Q&A systems
- System prompt — one marker (rarely changes)
- Skills/tools definitions — one marker (may change in combination)
- Project context — one marker (may switch or compress)
- User messages / tool calls — one marker (grows each turn)
- User continues asking about the same product: Persona, tools, and knowledge base are all unchanged, hitting the cache at marker 2 (longest prefix match) for maximum savings.
- More conversation turns: Earlier content (persona + tools + knowledge base + history) hits the previous turn's cache; only the new content requires a fresh cache.
Arrange content from most stable to least stable: place content that changes least at the beginning (e.g., system persona) and content that changes most at the end (e.g., current conversation) to maximize cache hit rates.
Batch processing (task completion)
Characteristics:
- Single-turn requests, no context memory needed
- Fixed long system prompt (task instructions) + variable user input (data to process)
- Typical use cases: text classification, intent recognition, data extraction, content moderation
cache_control marker only on the system prompt. All subsequent requests hit the cache as long as the system prompt remains unchanged.
Function Calling with cached tool definitions
Characteristics:
- Using Function Calling with a long list of tool definitions
- Tool definitions remain unchanged across requests
tools parameter content is part of the system prompt for caching. Ensure tool definitions are exactly identical across requests (same order, same field order, same structure), and add a cache_control marker to the message content.
Keys to maximizing Function Calling cache hits:
- Consistent tool order: Keep the same ordering of tools in the tools array.
- Consistent field order: Keep JSON field ordering the same within each tool definition.
- Consistent structure: Do not add, remove, or reorder fields between requests, even if they are optional or empty.
Important notes
-
Content format requirement: When adding
cache_control, the content field must be in array form. String-form content does not support cache markers. -
Cache marker granularity: Qwen3.5 and later models only support message-level cache breakpoints. Placing multiple
cache_controlmarkers within a single message's content array does not create separate breakpoints — the system only stores cache at the last marker position within that message and cannot truncate-match at intermediate content blocks. Additionally, multiple system messages are merged internally into a single segment and cannot serve as separate breakpoints. To create multiple independent breakpoints, distributecache_controlmarkers across messages with different roles (e.g., one on system, one on user). Models prior to Qwen3.5 support content-level (intra-message) breakpoints. -
Mutually exclusive with implicit cache: A request can only use one caching mode. If the request contains a
cache_controlmarker, explicit cache is used; otherwise, the system automatically uses implicit cache.