Audio and video translation

Translate audio and video through the OpenAI-compatible chat completions endpoint. Both streaming and non-streaming calls are supported. User guide: For tutorials and complete examples, see Audio and video translation.

The DashScope interface is not supported.

Endpoints

SDK `base_url`	HTTP endpoint
`https://dashscope-intl.aliyuncs.com/compatible-mode/v1`	`POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions`

Request body

Required parameters

Parameter	Type	Description
`model`	string	Model name: `qwen3-livetranslate-flash` or `qwen3-livetranslate-flash-2025-12-01`.
`messages`	array	A single user message. See Message object.
`stream`	boolean	Whether to stream the response. Use `true` for real-time translation progress, `false` for simpler integration.
`translation_options`	object	Translation settings. Non-standard OpenAI parameter: pass in `extra_body` for the Python SDK, or at the top level for Node.js/HTTP. See Translation options.

Optional parameters

Parameter	Type	Default	Description
`modalities`	array	`["text"]`	Output mode. Use `["text", "audio"]` for text and audio, or `["text"]` for text only.
`audio`	object	-	Audio output settings. Required when `modalities` includes `"audio"`. See Audio output options.
`stream_options`	object	-	Streaming settings. See Stream options.
`max_tokens`	integer	Model maximum	Maximum tokens to generate. Output is truncated if exceeded.
`seed`	integer	-	Random seed for reproducible output. Range: `[0, 2^31-1]`.

Sampling parameters

Keep these at their defaults for best translation accuracy.

Parameter	Type	Default	Range	Notes
`temperature`	float	0.000001	[0, 2)	Controls output diversity.
`top_p`	float	0.8	(0, 1.0]	Nucleus sampling threshold.
`presence_penalty`	float	0	[-2.0, 2.0]	Reduces repetition when positive.
`top_k`	integer	1	>= 0	Candidate set size. Disabled when `None` or > 100 (`top_p` takes effect instead). Non-standard OpenAI parameter: use `extra_body` in the Python SDK.
`repetition_penalty`	float	1.05	> 0	Penalizes repeated sequences. Non-standard OpenAI parameter: use `extra_body` in the Python SDK.

Message object

The messages array must contain exactly one user message. content array items:

Field	Type	Required	Description
`type`	string	Yes	`input_audio` for audio, `video_url` for video.
`input_audio`	object	When `type` is `input_audio`	Audio input. See below.
`video_url`	object	When `type` is `video_url`	Video input. See below.

input_audio object:

Field	Type	Required	Description
`data`	string	Yes	Audio file URL or Base64 data URL. For local files, see Send a Base64-encoded local file.
`format`	string	Yes	Audio format, such as `mp3` or `wav`.

video_url object:

Field	Type	Required	Description
`url`	string	Yes	Public video URL or Base64 data URL. For local files, see Send a Base64-encoded local file.

Translation options

Field	Type	Required	Description
`source_lang`	string	No	Source language (full English name). See Supported languages. If omitted, the model detects it automatically.
`target_lang`	string	Yes	Target language (full English name). See Supported languages.

extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}}

Audio output options

Required when modalities is ["text", "audio"].

Field	Type	Required	Description
`voice`	string	Yes	Output voice.
`format`	string	Yes	Output audio format. Only `wav` is supported.

Stream options

Field	Type	Default	Description
`include_usage`	boolean	`false`	When `true`, the final chunk includes token usage details.

Response

The API streams chat.completion.chunk objects in three categories: text, audio, and token usage.

Text chunk

Contains incremental translated text in choices[0].delta.content:

{
  "id": "chatcmpl-c22a54b8-40cc-4a1d-988b-f84cdf86868f",
  "choices": [
    {
      "delta": {
        "content": " of",
        "role": null,
        "audio": null
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1764755440,
  "model": "qwen3-livetranslate-flash",
  "object": "chat.completion.chunk"
}

Audio chunk

Contains incremental Base64-encoded audio in choices[0].delta.audio.data:

{
  "id": "chatcmpl-c22a54b8-40cc-4a1d-988b-f84cdf86868f",
  "choices": [
    {
      "delta": {
        "content": null,
        "role": null,
        "audio": {
          "data": "///+//7////+////////////AAAAAAAAAAABA......",
          "expires_at": 1764755440,
          "id": "audio_c22a54b8-40cc-4a1d-988b-f84cdf86868f"
        }
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1764755440,
  "model": "qwen3-livetranslate-flash",
  "object": "chat.completion.chunk"
}

Token usage chunk

Sent last when include_usage is true. The choices array is empty. usage holds the token breakdown:

{
  "id": "chatcmpl-c22a54b8-40cc-4a1d-988b-f84cdf86868f",
  "choices": [],
  "created": 1764755440,
  "model": "qwen3-livetranslate-flash",
  "object": "chat.completion.chunk",
  "usage": {
    "completion_tokens": 242,
    "prompt_tokens": 415,
    "total_tokens": 657,
    "completion_tokens_details": {
      "accepted_prediction_tokens": null,
      "audio_tokens": 191,
      "reasoning_tokens": null,
      "rejected_prediction_tokens": null,
      "text_tokens": 51
    },
    "prompt_tokens_details": {
      "audio_tokens": 415,
      "cached_tokens": null,
      "text_tokens": 0,
      "video_tokens": null
    }
  }
}

For video input, prompt_tokens_details.audio_tokens includes audio tokens from the video. video_tokens reports the video-specific count.

Response fields

Field	Type	Description
`id`	string	Request identifier. Same across all chunks.
`choices`	array	Generated content. Empty in the final usage chunk.
`choices[].delta.content`	string	Incremental translated text. `null` in audio chunks.
`choices[].delta.audio`	object	Incremental audio data. `null` in text chunks.
`choices[].delta.audio.data`	string	Base64-encoded audio segment.
`choices[].delta.audio.id`	string	Output audio identifier.
`choices[].delta.audio.expires_at`	integer	Timestamp when the request was created.
`choices[].delta.role`	string	Message role. Present only in the first chunk.
`choices[].finish_reason`	string	`stop` when complete, `length` when truncated by `max_tokens`, `null` while in progress.
`choices[].index`	integer	Always `0`.
`created`	integer	Request Unix timestamp. Same across all chunks.
`model`	string	Model used.
`object`	string	Always `chat.completion.chunk`.
`usage`	object	Token usage. Present only in the final chunk when `include_usage` is `true`.
`usage.prompt_tokens`	integer	Total input tokens.
`usage.completion_tokens`	integer	Total output tokens.
`usage.total_tokens`	integer	Sum of `prompt_tokens` and `completion_tokens`.
`usage.completion_tokens_details.audio_tokens`	integer	Output audio tokens.
`usage.completion_tokens_details.text_tokens`	integer	Output text tokens.
`usage.prompt_tokens_details.audio_tokens`	integer	Input audio tokens. For video input, includes audio tokens from the video.
`usage.prompt_tokens_details.text_tokens`	integer	Input text tokens. Always `0`.
`usage.prompt_tokens_details.video_tokens`	integer	Input video tokens. Present only for video input.

Fields fixed to null

These fields exist for OpenAI compatibility but always return null: reasoning_content, function_call, refusal, tool_calls, logprobs, service_tier, system_fingerprint

​Endpoints

​Request body

​Required parameters

​Optional parameters

​Sampling parameters

​Message object

​Translation options

​Audio output options

​Stream options

​Response

​Text chunk

​Audio chunk

​Token usage chunk

​Response fields

​Fields fixed to null

​References

Endpoints

Request body

Required parameters

Optional parameters

Sampling parameters

Message object

Translation options

Audio output options

Stream options

Response

Text chunk

Audio chunk

Token usage chunk

Response fields

Fields fixed to null

References