Common questions about streaming, context management, context caching, structured output, thinking mode, function calling, and batch.
Streaming
APIs: OpenAI compatible, DashScope
Why does streaming stop mid-response?
Check your proxy or infrastructure:
- Nginx buffering — Nginx buffers SSE by default. Add
proxy_buffering offto your server config. - Request timeout — If the timeout is shorter than the response time, the connection closes early. Increase it.
- Firewall closing idle connections — Some firewalls close connections that appear idle during long pauses between tokens.
data: [DONE]never arrives — Ensure your client reads the stream until the connection closes or the sentinel is received.
How do I accumulate streaming chunks into a full response?
OpenAI compatible: Set stream=True. Each chunk has choices[0].delta.content. Concatenate all non-null content values. Get token usage from the last chunk by also setting stream_options={"include_usage": True}.
DashScope: Set incremental_output=True. Each chunk contains only the new tokens (not the full response so far). Concatenate them in order. Token usage is available on every chunk.
For models that think before responding (QwQ, Qwen3 with enable_thinking=True), chunks arrive in two phases: first reasoning_content, then content. Accumulate each separately.
Which models require streaming?
QwQ and QVQ are streaming-only. Calling them without stream=True fails or returns an empty response. Most other Qwen3 models work in both modes. When using function calling with qwen3-omni-flash, streaming is required only for function calling — other uses support both modes.
Context and multi-turn
APIs: Multi-turn conversations, Context caching
How does multi-turn conversation work?
The API is stateless — it does not store conversation history. You maintain the messages array yourself. After each round, append both the assistant's reply and the next user message, then send the full array with the next request.
The Responses API (/v2/apps/...) offers a shortcut: pass previous_response_id to link turns automatically without managing the array manually. Response IDs expire after 7 days.
How do I control conversation length and avoid token overflow?
Common strategies:
- Truncation — drop the oldest messages when total tokens approach the context limit.
- Summarization — summarize older turns into a single system or user message.
- Retrieval — store turns in a vector store and retrieve only the relevant ones per request.
usage.prompt_tokens in each response to track growth. Context limits vary by model — see Model selection.
What is context caching and when should I use it?
Context caching stores a prefix of your prompt so repeated calls re-use it at a reduced token price. There are three modes:
| Mode | How to enable | Cache creation price | Cache hit price | Minimum tokens |
|---|---|---|---|---|
| Implicit | Automatic — no config needed | Standard input price | 20% of input price | 256 |
| Explicit | Add cache_control: {type: ephemeral} marker | 125% of input price | 10% of input price | 1024 |
| Session | Responses API + x-dashscope-session-cache: enable header | 125% of input price | 10% of input price | 1024 |
usage.prompt_tokens_details.cached_tokens. Cache is per-account, per-model — not shared across accounts or models.
Does the API maintain conversation memory automatically?
No. The API is stateless — there is no built-in memory. You maintain the messages array yourself. For persistent memory across sessions, store conversation history externally and include relevant history in each request.
Structured output
API: Structured output
How do I get JSON output reliably?
Set response_format={"type": "json_object"} and include the word "JSON" in your prompt (for example, "Return your answer as a JSON object with keys ..."). The model is then constrained to output valid JSON.
Do not set max_tokens — a truncated response produces invalid JSON.
Why does the model output extra text before or after the JSON?
This happens when the prompt does not clearly instruct the model to output only JSON, or when response_format is not set. Add response_format={"type": "json_object"} and phrase the prompt so the expected output is JSON only.
If you need structured output from a thinking model (which does not support response_format), collect the full response text, then call json.loads() on it. If that raises a JSONDecodeError, send the raw text to a fast model (for example qwen3.5-flash) with response_format={"type": "json_object"} asking it to repair the JSON.
Thinking mode
API: Thinking mode
What is thinking mode?
Thinking mode causes the model to reason step-by-step before producing its final answer. The reasoning appears in reasoning_content; the answer appears in content. This improves accuracy on complex tasks such as math, coding, and multi-step reasoning, at the cost of more tokens and higher latency.
There are two types of thinking models:
- Hybrid models — you toggle thinking on or off. Qwen3.5 series has thinking on by default; Qwen3, Qwen3-VL, and Qwen3-Omni have it off by default. Toggle with
enable_thinking=True/Falseinextra_body(OpenAI compatible) or as a direct parameter (DashScope). - Thinking-only models — QwQ and
-thinkingvariants always think; the flag is not available.
/think or /no_think at the start of a user message to toggle per-turn without changing the request parameters. This works for hybrid models only — thinking-only models (QwQ and -thinking variants) cannot disable thinking.
How does thinking affect billing?
Thinking tokens (reasoning_content) are billed at the output token price. Set thinking_budget to a positive integer token count to cap reasoning; lower values reduce cost but may reduce quality on hard problems.
Batch processing applies the 50% discount to both thinking and answer tokens. See Pricing.
Can I hide thinking from the end user?
Yes. The reasoning_content field is separate from content. You can log or discard reasoning_content on the server and send only content to end users.
When function calling with thinking enabled, you must include reasoning_content in subsequent assistant messages sent back to the model — omitting it degrades accuracy. This is an internal concern that does not affect what the user sees.
Function calling
API: Function calling
How do I call local functions sequentially via function calling?
The standard loop:
- Send user message + tool definitions.
- Model returns a
tool_callsresponse — no final answer yet. - Execute the named function with the provided arguments.
- Append the tool result as a
role: "tool"message. - Call the model again. Repeat until no
tool_callsin the response. - The final response (no
tool_calls) is the answer.
parallel_tool_calls=True only when tasks are independent. For dependent tasks (tool A's input depends on tool B's output), run tool calls serially: after each tool result, send it back to the model and wait for the next tool_calls response before executing the next tool.
What is the difference between function calling and built-in tools?
Function calling (tools parameter) lets you define custom functions that run in your application. The model returns a structured instruction; your code executes the function and returns the result.
Built-in tools (web search, code interpreter, image search) are provided by the platform. Pass them as tool definitions using their specific type values — no custom execution code needed. See Web search, Code interpreter.
Tool descriptions count as input tokens and are billed as part of the prompt.
Can I call two local functions sequentially via function calling?
Yes — this is the standard loop. After receiving a tool_calls response, execute the function, send the result back as a role: "tool" message, and call the model again. The model returns the next tool_calls response. Repeat until no tool_calls remain. See the function calling loop above.
Batch
API: Batch API
When should I use batch vs. real-time requests?
Use batch for large offline workloads — document processing, dataset annotation, evaluation pipelines — where a few hours of latency is acceptable. Batch requests are billed at 50% of the real-time price.
Use real-time requests for interactive use cases where users wait for the response.
What are the batch size limits?
| Limit | Value |
|---|---|
| Requests per file | 50,000 |
| File size | 500 MB |
| Per-line size | 6 MB |
| Models per file | 1 (all requests must use the same model) |
| Completion window | 24 h – 336 h (14 days) |
qwen-max, qwen-plus, qwen-flash, qwen-turbo.
How do I retrieve batch results?
- Poll the batch status every 1–2 minutes using
client.batches.retrieve(batch_id)(Python OpenAI SDK) or the equivalent DashScope endpoint. - Status lifecycle:
validating→in_progress→finalizing→completed. - When
completed, download withclient.files.content(output_file_id). - Download the error file (
error_file_id) separately for any failed requests.
custom_id.
The batch discount does not stack with context cache or other discounts. Only successful requests are billed.
Models
How many languages do the Qwen models support?
Qwen3.5 and Qwen3-VL series models support 33 languages, including Chinese, English, Japanese, Korean, Arabic, Spanish, French, Portuguese, German, Italian, Russian, Vietnamese, Thai, and Indonesian. See Model selection for model-specific details.
How are large language model parameters stored?
Open-source models can be downloaded from ModelScope. Model structure is defined in JSON configuration files, with learned weights stored as vector data files. Use Python libraries such as transformers to load and parse them.