Skip to main content
Translation

Audio and video file translation

18-language translation

Model details

ModelVersionContext windowMax inputMax output
qwen3-livetranslate-flashStable53,248 tokens49,152 tokens4,096 tokens
qwen3-livetranslate-flash-2025-12-01Snapshot53,248 tokens49,152 tokens4,096 tokens
qwen3-livetranslate-flash currently has the same capabilities as qwen3-livetranslate-flash-2025-12-01.

Getting started

Prerequisites

  1. Get an API key.
  2. Set it as an environment variable.
  3. (Optional) If you use the OpenAI SDK, install the SDK.
All examples use the OpenAI-compatible streaming API with translation_options to set the source and target languages. The default input is audio. To translate a video file instead, uncomment the video input block in each example.
Specifying source_lang improves translation accuracy. Omitting it enables automatic language detection.
  • Python
  • Node.js
  • curl
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# --- Audio input ---
messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "input_audio": {
          "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
          "format": "wav",
        },
      }
    ],
  }
]

# --- Video input (uncomment to use) ---
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "video_url",
#                 "video_url": {
#                     "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
#                 },
#             }
#         ],
#     },
# ]

completion = client.chat.completions.create(
  model="qwen3-livetranslate-flash",
  messages=messages,
  modalities=["text", "audio"],
  audio={"voice": "Cherry", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
  # translation_options is not a standard OpenAI parameter; pass it through extra_body
  extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

for chunk in completion:
  print(chunk)
These examples use a public file URL.

Send a Base64-encoded local file

To translate a local audio file, read it and encode it as Base64. Pass the data as a data URI with the format data:audio/<format>;base64,<base64_data> (for example, data:audio/wav;base64,UklGRiQAAABXQVZFZm10...).
Supported audio formats: WAV, MP3, FLAC, AAC, OGG, OPUS, M4A, WMA, AMR. Sample rate: 8kHz-48kHz.
  • Python
  • Node.js
  • curl
import os
import base64
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# Read and encode a local audio file
with open("local_audio.wav", "rb") as f:
  audio_base64 = base64.b64encode(f.read()).decode("utf-8")

completion = client.chat.completions.create(
  model="qwen3-livetranslate-flash",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": f"data:audio/wav;base64,{audio_base64}",
            "format": "wav",
          },
        }
      ],
    }
  ],
  modalities=["text", "audio"],
  audio={"voice": "Cherry", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
  extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

for chunk in completion:
  print(chunk)

Request parameters

Input

The messages array must contain exactly one message with role set to user. The content field holds the audio or video to translate:
  • Audio: Set type to input_audio. Provide the file URL or a data URI (for example, data:audio/wav;base64,<base64_data>) in input_audio.data, and specify the format (for example, wav) in input_audio.format. See Send a Base64-encoded local file for details.
  • Video: Set type to video_url. Provide the file URL in video_url.url.

Translation options

Specify the source and target languages in the translation_options parameter:
"translation_options": {"source_lang": "zh", "target_lang": "en"}
In the Python SDK, translation_options is not a standard OpenAI parameter. Pass it through extra_body:
extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}}

Output modality

Control the output format with the modalities parameter:
modalities valueOutput
["text"]Translated text only
["text", "audio"]Translated text and Base64-encoded synthesized audio
When the output includes audio, set the voice in the audio parameter. See Supported voices for available options.

Constraints

  • Single-turn only: The model handles one translation per request. Multi-turn conversations are not supported.
  • No system message: The system role is not supported.

Parse the response

Each streaming chunk object contains:
  • Text: chunk.choices[0].delta.content
  • Audio: chunk.choices[0].delta.audio["data"] (Base64-encoded, 24 kHz sample rate)

Save audio to a file

Concatenate all Base64 audio fragments from the stream, then decode and save the result after the stream completes.
  • Python
  • Node.js
import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "input_audio": {
          "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
          "format": "wav",
        },
      }
    ],
  }
]

completion = client.chat.completions.create(
  model="qwen3-livetranslate-flash",
  messages=messages,
  modalities=["text", "audio"],
  audio={"voice": "Cherry", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
  extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# Concatenate Base64 fragments, then decode after the stream completes
audio_string = ""
for chunk in completion:
  if chunk.choices:
    if hasattr(chunk.choices[0].delta, "audio"):
      try:
        audio_string += chunk.choices[0].delta.audio["data"]
      except Exception as e:
        print(chunk.choices[0].delta.audio["transcript"])
  else:
    print(chunk.usage)

wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("output.wav", audio_np, samplerate=24000)

Real-time playback

Decode each Base64 fragment as it arrives and play it directly. This approach requires platform-specific audio libraries.
  • Python
  • Node.js
Install pyaudio first:
PlatformInstallation
macOSbrew install portaudio && pip install pyaudio
Ubuntu / Debiansudo apt-get install python-pyaudio python3-pyaudio or pip install pyaudio
CentOSsudo yum install -y portaudio portaudio-devel && pip install pyaudio
Windowspython -m pip install pyaudio
import os
from openai import OpenAI
import base64
import numpy as np
import pyaudio
import time

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "input_audio": {
          "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
          "format": "wav",
        },
      }
    ],
  }
]

completion = client.chat.completions.create(
  model="qwen3-livetranslate-flash",
  messages=messages,
  modalities=["text", "audio"],
  audio={"voice": "Cherry", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
  extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# Initialize PyAudio for real-time playback
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

for chunk in completion:
  if chunk.choices:
    if hasattr(chunk.choices[0].delta, "audio"):
      try:
        audio_data = chunk.choices[0].delta.audio["data"]
        wav_bytes = base64.b64decode(audio_data)
        audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
        stream.write(audio_np.tobytes())
      except Exception as e:
        print(chunk.choices[0].delta.audio["transcript"])

time.sleep(0.8)
stream.stop_stream()
stream.close()
p.terminate()

Billing

  • Audio
  • Video
Audio token consumption depends on the audio characteristics (such as sample rate). To see actual token usage, set stream_options.include_usage to true and check the usage field in the response.
Audio shorter than 1 second is billed as 1 second.
For token pricing, see Choose models.

Supported languages

The following language codes can be used for source and target languages. Some target languages support text output only.
Language codeLanguageSupported output
enEnglishAudio, text
zhChineseAudio, text
ruRussianAudio, text
frFrenchAudio, text
deGermanAudio, text
ptPortugueseAudio, text
esSpanishAudio, text
itItalianAudio, text
idIndonesianText
koKoreanAudio, text
jaJapaneseAudio, text
viVietnameseText
thThaiText
arArabicText
yueCantoneseAudio, text
hiHindiText
elGreekText
trTurkishText

Supported voices

Set the voice parameter when the output includes synthesized audio.
Voice namevoice parameterDescriptionSupported languages
CherryCherryA cheerful, friendly, and genuine young woman.Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
EthanEthanStandard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant.Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
NofishNofishA designer who has difficulty pronouncing retroflex consonants.Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai-JadaJadaA bustling and energetic Shanghai lady.Chinese
Beijing-DylanDylanA young man who grew up in the hutongs of Beijing.Chinese
Sichuan-SunnySunnyA sweet girl from Sichuan.Chinese
Tianjin-PeterPeterA voice in the style of a Tianjin crosstalk performer (the supporting role).Chinese
Cantonese-KikiKikiA sweet best friend from Hong Kong.Cantonese
Sichuan-EricEricA man from Chengdu, Sichuan, who is unconventional and stands out from the crowd.Chinese

Alternative: Use Qwen-Omni

You can also use Qwen-Omni (qwen3-omni-flash) with a translation prompt to translate audio and video files.
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-flash",
  messages=[{
  "role": "user",
  "content": [
  {"type": "input_audio", "input_audio": {"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav"}},
  {"type": "text", "text": "Translate this audio from English to Chinese."}
  ]
  }],
  modalities=["text"],
  stream=True,
)

for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")
For full Qwen-Omni capabilities including multimodal conversation, see Audio and video file understanding.

FAQ

When I input a video file, what content is translated?

The model translates the audio track from the video. Visual information serves as context to improve translation accuracy. For example, if the audio says "This is a mask":
  • When the video shows a medical mask, the model translates it as "This is a medical mask."
  • When the video shows a masquerade mask, the model translates it as "This is a masquerade mask."

API reference

For full input and output parameter details, see Audio and video translation - Qwen.