Skip to main content
Run and Scale

Streaming output

Receive text output token by token as it is generated.

Enable streaming

  • OpenAI Chat Completions
  • OpenAI Responses API
  • DashScope
No usage by default. Set stream_options to get token counts in the last chunk only.
import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv("DASHSCOPE_API_KEY"), base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1")

completion = client.chat.completions.create(
  model="qwen3.6-plus",
  messages=[{"role": "user", "content": "Hi"}],
  stream=True,                              # ← enable streaming
  stream_options={"include_usage": True},    # ← usage in last chunk only
)
for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
  elif chunk.usage:                         # ← last chunk: usage only
    print(f"\nTokens: {chunk.usage.total_tokens}")

Event format

Each SSE event is a data: line containing a JSON chunk. The final data: [DONE] signals the end of the stream.
data: {"choices":[{"delta":{"content":"I am"},...,"finish_reason":null}],...}
data: {"choices":[{"delta":{"content":" Qwen"},...,"finish_reason":null}],...}
data: {"choices":[{"delta":{"content":""},...,"finish_reason":"stop"}],...}
data: [DONE]

Streaming with thinking mode

Two-phase streaming: thinking first, then the answer.
  • OpenAI Chat Completions
  • OpenAI Responses API
  • DashScope
for chunk in completion:
  delta = chunk.choices[0].delta
  if hasattr(delta, "reasoning_content") and delta.reasoning_content:
    print(delta.reasoning_content, end="", flush=True)  # ← phase 1: thinking
  if hasattr(delta, "content") and delta.content:
    print(delta.content, end="", flush=True)             # ← phase 2: answer
→ Full config: Reasoning | Qwen3-Omni: Audio and video

Streaming with tool calls

When streaming function calling responses, tool call arguments arrive as incremental deltas that must be concatenated before JSON parsing.
  • Chat Completions
  • Responses API
Each chunk's delta may contain tool_calls[i].function.arguments — a partial JSON string. Accumulate all fragments per tool call index, then JSON.parse() the complete string.
tool_args = {}
for chunk in completion:
  delta = chunk.choices[0].delta
  if delta.tool_calls:
    for tc in delta.tool_calls:
      tool_args.setdefault(tc.index, "")
      tool_args[tc.index] += tc.function.arguments or ""
# After stream ends:
for idx, args_str in tool_args.items():
  parsed = json.loads(args_str)
With thinking mode enabled, the stream delivers three phases: thinking tokens, then tool call deltas, then (after you send tool results) the final answer.

Notes

  • Nginx proxy: set proxy_buffering off or SSE events will buffer
  • High concurrency: size connection pool, monitor file descriptors
  • Web frontend: use ReadableStream + TextDecoderStream
  • Quality: streaming does not affect response quality
  • Streaming-only models: QwQ and QVQ only support streaming output. Non-streaming calls to these models will fail or return empty content.