Manage chat context
The Qwen API is stateless and does not save conversation history. To implement multi-turn conversations, you must pass the conversation history in each request. You can also use strategies, such as truncation, summarization, and retrieval, to efficiently manage the context and reduce token consumption.
To implement a multi-turn conversation, you must maintain a
Multi-turn conversations for multimodal models differ from text models:
Thinking models return two fields:
First, create a conversation using the Conversations API, then pass the
You can manually add message items to a conversation (such as supplementary user messages or external knowledge).
List all message items in a conversation to view the complete dialogue history.
For more Conversations API operations (update conversation, delete conversation, delete messages, etc.), see Conversations.
Multi-turn conversations can consume many tokens and may exceed the model's context limit. Use these strategies to manage context and control costs.
The
When the conversation history becomes too long, keep only the most recent N rounds of conversation. This method is simple to implement but results in the loss of earlier conversation information.
To dynamically compress the conversation history and control the context length without losing core information, summarize the context as the conversation progresses:
a. When the conversation history reaches a certain length, such as 70% of the maximum context length, extract an earlier part of the history, such as the first half. Then, make a separate API call to the model to generate a "memory summary" of this part.
b. When you construct the next request, replace the lengthy conversation history with the "memory summary" and append the most recent conversation rounds.
A rolling summary can cause some information loss. To allow the model to recall relevant information from a large volume of conversation history, switch from linear context passing to on-demand retrieval:
a. After each conversation round, store the conversation in a vector database.
b. When a user asks a question, retrieve relevant conversation records based on similarity.
c. Combine the retrieved conversation records with the most recent user input and send the combined content to the model.
Input tokens increase with each round, raising costs.
Use the context management strategies described previously to reduce input tokens and lower costs.
The
If a call fails, see Error messages.
This topic describes how to implement multi-turn conversation using OpenAI compatible Chat Completion or DashScope API. The Responses API provides a more convenient alternative, see OpenAI compatible - Responses.
How it works
To implement a multi-turn conversation, you must maintain a messages array. In each round, append the user's latest question and the model's response to this array. Then, use the updated array as the input for the next request.
Example of how the messages array changes during a conversation:
1
First round
Add the user's question to the
messages array.2
Second round
Add the model's response and the user's latest question to the
messages array.Getting started
- Responses API
- OpenAI compatible
- DashScope
The Responses API simplifies multi-turn conversations. Pass Response example (second round):
previous_response_id to link context automatically—no manual message history needed. For advanced session management, see Using Conversations.Use the response
id (UUID format, such as f0dbb153-117f-9bbf-8176-5284b47f3xxx) as previous_response_id. Do not use a message id from the output array (such as msg_56c860c4-3ad8-4a96-8553-d2f94c259xxx). The response id expires in 7 days.For multimodal models
- This section applies to multimodal models such as Qwen3-VL and Qwen3.5. For
Qwen-Omni, see Non-Realtime. - Qwen3-Omni-Captioner is designed for single-turn tasks and does not support multi-turn conversations.
- Construction of user messages: User messages for multimodal models can contain multimodal information, such as images and audio, in addition to text.
- DashScope SDK interface: When you use the DashScope Python SDK, call the
MultiModalConversationinterface. When you use the DashScope Java SDK, call theMultiModalConversationclass.
- OpenAI compatible
- DashScope
For thinking models
Thinking models return two fields: reasoning_content (the thinking process) and content (the response). When you update the messages array, retain only the content field and ignore the reasoning_content field.
For more information about thinking models, see Thinking and Vision. For multi-turn conversations with Qwen3-Omni-Flash (thinking mode), see Non-Realtime.
- OpenAI compatible
- DashScope
Using conversations
previous_response_id works well for simple chained conversations. For server-side session management, cross-device continuity, or manual message control, use the Conversations API with the conversation parameter.
Create a conversation and chat
First, create a conversation using the Conversations API, then pass the conversation parameter and instructions (system prompt) to responses.create. The server manages context automatically.
Add messages to a conversation
You can manually add message items to a conversation (such as supplementary user messages or external knowledge).
View conversation history
List all message items in a conversation to view the complete dialogue history.
Important notes
- ID validity: Response
idand message items in a conversation are valid for 7 days. Theconversationitself has no expiration, but expired items no longer participate in context. - Correct ID source: Use the response top-level
id, not theidof messages inside theoutputarray. - Cross-turn context: Each time you pass
previous_response_id, the system automatically links the full context from the initial conversation to the current turn. - Mutual exclusivity:
previous_response_idandconversationcannot be used together. Otherwise, you'll get error[400] INVALID_REQUEST: Mutually exclusive parameters: Ensure you are only providing one of: previous_response_id or conversation. - Conversation message expiry: The
conversationitself has no expiry and can be used continuously. However, message items within it expire after 7 days and will no longer appear in the conversation context. We recommend passing system instructions via theinstructionsparameter rather than through items when creating a conversation, to avoid losing system instructions due to expiry.
Which approach to choose?
| Approach | Best for |
|---|---|
previous_response_id | Simple chained multi-turn conversations without creating a separate session |
conversation | Server-side session management, cross-device continuity, or manual message add/delete |
Going live
Multi-turn conversations can consume many tokens and may exceed the model's context limit. Use these strategies to manage context and control costs.
1. Context management
The messages array grows with each round and may exceed the token limit.
1.1. Context truncation
When the conversation history becomes too long, keep only the most recent N rounds of conversation. This method is simple to implement but results in the loss of earlier conversation information.
1.2. Rolling summary
To dynamically compress the conversation history and control the context length without losing core information, summarize the context as the conversation progresses:
a. When the conversation history reaches a certain length, such as 70% of the maximum context length, extract an earlier part of the history, such as the first half. Then, make a separate API call to the model to generate a "memory summary" of this part.
b. When you construct the next request, replace the lengthy conversation history with the "memory summary" and append the most recent conversation rounds.
1.3. Vectorized retrieval
A rolling summary can cause some information loss. To allow the model to recall relevant information from a large volume of conversation history, switch from linear context passing to on-demand retrieval:
a. After each conversation round, store the conversation in a vector database.
b. When a user asks a question, retrieve relevant conversation records based on similarity.
c. Combine the retrieved conversation records with the most recent user input and send the combined content to the model.
2. Cost control
Input tokens increase with each round, raising costs.
2.1. Reduce input tokens
Use the context management strategies described previously to reduce input tokens and lower costs.
2.2. Use models that support context cache
The messages array is repeatedly processed and billed. Context cache (available for select Qwen models including Qwen-Max, Qwen-Plus, Qwen-Flash, and Qwen-Coder) reduces costs and improves response speed.
The context cache feature is enabled automatically. No code changes are required.