Skip to main content
Models & inference

Text generation FAQ

Common questions about streaming, context management, context caching, structured output, thinking mode, function calling, and batch.

Streaming

APIs: OpenAI compatible, DashScope

Why does streaming stop mid-response?

Check your proxy or infrastructure:
  • Nginx buffering — Nginx buffers SSE by default. Add proxy_buffering off to your server config.
  • Request timeout — If the timeout is shorter than the response time, the connection closes early. Increase it.
  • Firewall closing idle connections — Some firewalls close connections that appear idle during long pauses between tokens.
  • data: [DONE] never arrives — Ensure your client reads the stream until the connection closes or the sentinel is received.

How do I accumulate streaming chunks into a full response?

OpenAI compatible: Set stream=True. Each chunk has choices[0].delta.content. Concatenate all non-null content values. Get token usage from the last chunk by also setting stream_options={"include_usage": True}. DashScope: Set incremental_output=True. Each chunk contains only the new tokens (not the full response so far). Concatenate them in order. Token usage is available on every chunk. For models that think before responding (QwQ, Qwen3 with enable_thinking=True), chunks arrive in two phases: first reasoning_content, then content. Accumulate each separately.

Which models require streaming?

QwQ and QVQ are streaming-only. Calling them without stream=True fails or returns an empty response. Most other Qwen3 models work in both modes. When using function calling with qwen3-omni-flash, streaming is required only for function calling — other uses support both modes.

Context and multi-turn

APIs: Multi-turn conversations, Context caching

How does multi-turn conversation work?

The API is stateless — it does not store conversation history. You maintain the messages array yourself. After each round, append both the assistant's reply and the next user message, then send the full array with the next request. The Responses API (/v2/apps/...) offers a shortcut: pass previous_response_id to link turns automatically without managing the array manually. Response IDs expire after 7 days.

How do I control conversation length and avoid token overflow?

Common strategies:
  • Truncation — drop the oldest messages when total tokens approach the context limit.
  • Summarization — summarize older turns into a single system or user message.
  • Retrieval — store turns in a vector store and retrieve only the relevant ones per request.
Monitor usage.prompt_tokens in each response to track growth. Context limits vary by model — see Model selection.

What is context caching and when should I use it?

Context caching stores a prefix of your prompt so repeated calls re-use it at a reduced token price. There are three modes:
ModeHow to enableCache creation priceCache hit priceMinimum tokens
ImplicitAutomatic — no config neededStandard input price20% of input price256
ExplicitAdd cache_control: {type: ephemeral} marker125% of input price10% of input price1024
SessionResponses API + x-dashscope-session-cache: enable header125% of input price10% of input price1024
Implicit caching applies automatically when the model detects a repeated prefix. Explicit and Session modes charge a cache creation fee when the prefix is written, then bill subsequent hits at the cache hit price. Explicit caching is useful when you control a large, stable system prompt or document that is prepended to every request. Session caching is for Responses API workflows. Cache hits appear in usage.prompt_tokens_details.cached_tokens. Cache is per-account, per-model — not shared across accounts or models.

Does the API maintain conversation memory automatically?

No. The API is stateless — there is no built-in memory. You maintain the messages array yourself. For persistent memory across sessions, store conversation history externally and include relevant history in each request.

Structured output

API: Structured output

How do I get JSON output reliably?

Set response_format={"type": "json_object"} and include the word "JSON" in your prompt (for example, "Return your answer as a JSON object with keys ..."). The model is then constrained to output valid JSON. Do not set max_tokens — a truncated response produces invalid JSON.

Why does the model output extra text before or after the JSON?

This happens when the prompt does not clearly instruct the model to output only JSON, or when response_format is not set. Add response_format={"type": "json_object"} and phrase the prompt so the expected output is JSON only. If you need structured output from a thinking model (which does not support response_format), collect the full response text, then call json.loads() on it. If that raises a JSONDecodeError, send the raw text to a fast model (for example qwen3.5-flash) with response_format={"type": "json_object"} asking it to repair the JSON.

Thinking mode

API: Thinking mode

What is thinking mode?

Thinking mode causes the model to reason step-by-step before producing its final answer. The reasoning appears in reasoning_content; the answer appears in content. This improves accuracy on complex tasks such as math, coding, and multi-step reasoning, at the cost of more tokens and higher latency. There are two types of thinking models:
  • Hybrid models — you toggle thinking on or off. Qwen3.5 series has thinking on by default; Qwen3, Qwen3-VL, and Qwen3-Omni have it off by default. Toggle with enable_thinking=True/False in extra_body (OpenAI compatible) or as a direct parameter (DashScope).
  • Thinking-only models — QwQ and -thinking variants always think; the flag is not available.
Use /think or /no_think at the start of a user message to toggle per-turn without changing the request parameters. This works for hybrid models only — thinking-only models (QwQ and -thinking variants) cannot disable thinking.

How does thinking affect billing?

Thinking tokens (reasoning_content) are billed at the output token price. Set thinking_budget to a positive integer token count to cap reasoning; lower values reduce cost but may reduce quality on hard problems. Batch processing applies the 50% discount to both thinking and answer tokens. See Pricing.

Can I hide thinking from the end user?

Yes. The reasoning_content field is separate from content. You can log or discard reasoning_content on the server and send only content to end users. When function calling with thinking enabled, you must include reasoning_content in subsequent assistant messages sent back to the model — omitting it degrades accuracy. This is an internal concern that does not affect what the user sees.

Function calling

API: Function calling

How do I call local functions sequentially via function calling?

The standard loop:
  1. Send user message + tool definitions.
  2. Model returns a tool_calls response — no final answer yet.
  3. Execute the named function with the provided arguments.
  4. Append the tool result as a role: "tool" message.
  5. Call the model again. Repeat until no tool_calls in the response.
  6. The final response (no tool_calls) is the answer.
Use parallel_tool_calls=True only when tasks are independent. For dependent tasks (tool A's input depends on tool B's output), run tool calls serially: after each tool result, send it back to the model and wait for the next tool_calls response before executing the next tool.

What is the difference between function calling and built-in tools?

Function calling (tools parameter) lets you define custom functions that run in your application. The model returns a structured instruction; your code executes the function and returns the result. Built-in tools (web search, code interpreter, image search) are provided by the platform. Pass them as tool definitions using their specific type values — no custom execution code needed. See Web search, Code interpreter. Tool descriptions count as input tokens and are billed as part of the prompt.

Can I call two local functions sequentially via function calling?

Yes — this is the standard loop. After receiving a tool_calls response, execute the function, send the result back as a role: "tool" message, and call the model again. The model returns the next tool_calls response. Repeat until no tool_calls remain. See the function calling loop above.

Batch

API: Batch API

When should I use batch vs. real-time requests?

Use batch for large offline workloads — document processing, dataset annotation, evaluation pipelines — where a few hours of latency is acceptable. Batch requests are billed at 50% of the real-time price. Use real-time requests for interactive use cases where users wait for the response.

What are the batch size limits?

LimitValue
Requests per file50,000
File size500 MB
Per-line size6 MB
Models per file1 (all requests must use the same model)
Completion window24 h – 336 h (14 days)
Supported models: qwen-max, qwen-plus, qwen-flash, qwen-turbo.

How do I retrieve batch results?

  1. Poll the batch status every 1–2 minutes using client.batches.retrieve(batch_id) (Python OpenAI SDK) or the equivalent DashScope endpoint.
  2. Status lifecycle: validatingin_progressfinalizingcompleted.
  3. When completed, download with client.files.content(output_file_id).
  4. Download the error file (error_file_id) separately for any failed requests.
Results are in JSONL. Each line maps back to your original request via custom_id. The batch discount does not stack with context cache or other discounts. Only successful requests are billed.

Models

How many languages do the Qwen models support?

Qwen3.5 and Qwen3-VL series models support 33 languages, including Chinese, English, Japanese, Korean, Arabic, Spanish, French, Portuguese, German, Italian, Russian, Vietnamese, Thai, and Indonesian. See Model selection for model-specific details.

How are large language model parameters stored?

Open-source models can be downloaded from ModelScope. Model structure is defined in JSON configuration files, with learned weights stored as vector data files. Use Python libraries such as transformers to load and parse them.

Can the models integrate with structured databases like MySQL or Hive?

Not through the standard API. For database-integrated workflows, use function calling to let the model generate SQL, execute it in your application, and return the results to the model.

API and SDK

How do I view error code information?

See Error Messages for the full list of error codes, descriptions, and recommended solutions.

How do I install the SDK?

Qwen Cloud supports DashScope SDKs (Python, Java) and OpenAI SDKs (Python, Node.js, Java, Go). See Install SDK.

Rate limits and performance

Is text generation speed the same for all users?

No. Generation speed varies by current service load and request concurrency. Speed is not user-configurable.

How long should I wait after hitting a rate limit?

Wait time depends on your rate limit tier (RPM/RPS). For example, a 120 RPM limit (2 requests/second) means submitting 2 requests within 0.2 seconds will throttle the third. Wait approximately the remainder of that second before retrying. Use exponential backoff for robustness.