Audio and video file translation

Model details

Model	Version	Context window	Max input	Max output
qwen3-livetranslate-flash	Stable	53,248 tokens	49,152 tokens	4,096 tokens
qwen3-livetranslate-flash-2025-12-01	Snapshot	53,248 tokens	49,152 tokens	4,096 tokens

qwen3-livetranslate-flash currently has the same capabilities as qwen3-livetranslate-flash-2025-12-01.

Getting started

Prerequisites

Get an API key.
Set it as an environment variable.
(Optional) If you use the OpenAI SDK, install the SDK.

All examples use the OpenAI-compatible streaming API with translation_options to set the source and target languages. The default input is audio. To translate a video file instead, uncomment the video input block in each example.

Specifying source_lang improves translation accuracy. Omitting it enables automatic language detection.

Python
Node.js
curl

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# --- Audio input ---
messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "input_audio": {
          "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
          "format": "wav",
        },
      }
    ],
  }
]

# --- Video input (uncomment to use) ---
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "video_url",
#                 "video_url": {
#                     "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
#                 },
#             }
#         ],
#     },
# ]

completion = client.chat.completions.create(
  model="qwen3-livetranslate-flash",
  messages=messages,
  modalities=["text", "audio"],
  audio={"voice": "Cherry", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
  # translation_options is not a standard OpenAI parameter; pass it through extra_body
  extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

for chunk in completion:
  print(chunk)

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

// --- Audio input ---
const messages = [
  {
    role: "user",
    content: [
      {
        type: "input_audio",
        input_audio: {
          data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
          format: "wav",
        },
      },
    ],
  },
];

// --- Video input (uncomment to use) ---
// const messages = [
//     {
//         role: "user",
//         content: [
//             {
//                 type: "video_url",
//                 video_url: {
//                     url: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4",
//                 },
//             },
//         ],
//     },
// ];

async function main() {
  const completion = await client.chat.completions.create({
    model: "qwen3-livetranslate-flash",
    messages: messages,
    modalities: ["text", "audio"],
    audio: { voice: "Cherry", format: "wav" },
    stream: true,
    stream_options: { include_usage: true },
    translation_options: { source_lang: "zh", target_lang: "en" },
  });

  for await (const chunk of completion) {
    console.log(JSON.stringify(chunk));
  }
}

main();

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen3-livetranslate-flash",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
            "format": "wav"
          }
        }
      ]
    }
  ],
  "modalities": ["text", "audio"],
  "audio": {
    "voice": "Cherry",
    "format": "wav"
  },
  "stream": true,
  "stream_options": {
    "include_usage": true
  },
  "translation_options": {
    "source_lang": "zh",
    "target_lang": "en"
  }
}'

These examples use a public file URL.

Send a Base64-encoded local file

To translate a local audio file, read it and encode it as Base64. Pass the data as a data URI with the format data:audio/<format>;base64,<base64_data> (for example, data:audio/wav;base64,UklGRiQAAABXQVZFZm10...).

Supported audio formats: WAV, MP3, FLAC, AAC, OGG, OPUS, M4A, WMA, AMR. Sample rate: 8kHz-48kHz.

Python
Node.js
curl

import os
import base64
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# Read and encode a local audio file
with open("local_audio.wav", "rb") as f:
  audio_base64 = base64.b64encode(f.read()).decode("utf-8")

completion = client.chat.completions.create(
  model="qwen3-livetranslate-flash",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": f"data:audio/wav;base64,{audio_base64}",
            "format": "wav",
          },
        }
      ],
    }
  ],
  modalities=["text", "audio"],
  audio={"voice": "Cherry", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
  extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

for chunk in completion:
  print(chunk)

import OpenAI from "openai";
import { readFileSync } from "node:fs";

const client = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

// Read and encode a local audio file
const audioBase64 = readFileSync("local_audio.wav").toString("base64");

async function main() {
  const completion = await client.chat.completions.create({
    model: "qwen3-livetranslate-flash",
    messages: [
      {
        role: "user",
        content: [
          {
            type: "input_audio",
            input_audio: {
              data: `data:audio/wav;base64,${audioBase64}`,
              format: "wav",
            },
          },
        ],
      },
    ],
    modalities: ["text", "audio"],
    audio: { voice: "Cherry", format: "wav" },
    stream: true,
    stream_options: { include_usage: true },
    translation_options: { source_lang: "zh", target_lang: "en" },
  });

  for await (const chunk of completion) {
    console.log(JSON.stringify(chunk));
  }
}

main();

# Encode the local audio file as Base64
AUDIO_BASE64=$(base64 < local_audio.wav)

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen3-livetranslate-flash",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "data:audio/wav;base64,'"$AUDIO_BASE64"'",
            "format": "wav"
          }
        }
      ]
    }
  ],
  "modalities": ["text", "audio"],
  "audio": {
    "voice": "Cherry",
    "format": "wav"
  },
  "stream": true,
  "stream_options": {
    "include_usage": true
  },
  "translation_options": {
    "source_lang": "zh",
    "target_lang": "en"
  }
}'

Request parameters

Input

The messages array must contain exactly one message with role set to user. The content field holds the audio or video to translate:

Audio: Set type to input_audio. Provide the file URL or a data URI (for example, data:audio/wav;base64,<base64_data>) in input_audio.data, and specify the format (for example, wav) in input_audio.format. See Send a Base64-encoded local file for details.
Video: Set type to video_url. Provide the file URL in video_url.url.

Translation options

Specify the source and target languages in the translation_options parameter:

"translation_options": {"source_lang": "zh", "target_lang": "en"}

In the Python SDK, translation_options is not a standard OpenAI parameter. Pass it through extra_body:

extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}}

Output modality

Control the output format with the modalities parameter:

`modalities` value	Output
`["text"]`	Translated text only
`["text", "audio"]`	Translated text and Base64-encoded synthesized audio

When the output includes audio, set the voice in the audio parameter. See Supported voices for available options.

Constraints

Single-turn only: The model handles one translation per request. Multi-turn conversations are not supported.
No system message: The system role is not supported.
Streaming and non-streaming: Both stream: true and stream: false are supported.
Output audio format: Only wav is supported for audio output.
Sampling defaults: The default sampling parameters (temperature, top_p, top_k, presence_penalty, repetition_penalty) are tuned for translation accuracy. Changing them may reduce output quality.

Parse the response

Each streaming chunk object contains:

Text: chunk.choices[0].delta.content
Audio: chunk.choices[0].delta.audio["data"] (Base64-encoded, 24 kHz sample rate)

Save audio to a file

Concatenate all Base64 audio fragments from the stream, then decode and save the result after the stream completes.

Python
Node.js

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "input_audio": {
          "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
          "format": "wav",
        },
      }
    ],
  }
]

completion = client.chat.completions.create(
  model="qwen3-livetranslate-flash",
  messages=messages,
  modalities=["text", "audio"],
  audio={"voice": "Cherry", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
  extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# Concatenate Base64 fragments, then decode after the stream completes
audio_string = ""
for chunk in completion:
  if chunk.choices:
    if hasattr(chunk.choices[0].delta, "audio"):
      try:
        audio_string += chunk.choices[0].delta.audio["data"]
      except Exception as e:
        print(chunk.choices[0].delta.audio["transcript"])
  else:
    print(chunk.usage)

wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("output.wav", audio_np, samplerate=24000)

import OpenAI from "openai";
import { createWriteStream } from "node:fs";
import { Writer } from "wav";

const client = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

const messages = [
  {
    role: "user",
    content: [
      {
        type: "input_audio",
        input_audio: {
          data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
          format: "wav",
        },
      },
    ],
  },
];

const completion = await client.chat.completions.create({
  model: "qwen3-livetranslate-flash",
  messages: messages,
  modalities: ["text", "audio"],
  audio: { voice: "Cherry", format: "wav" },
  stream: true,
  stream_options: { include_usage: true },
  translation_options: { source_lang: "zh", target_lang: "en" },
});

// Concatenate Base64 fragments, then decode after the stream completes
let audioString = "";
for await (const chunk of completion) {
  if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
    if (chunk.choices[0].delta.audio?.data) {
      audioString += chunk.choices[0].delta.audio.data;
    }
  } else {
    console.log(chunk.usage);
  }
}

// Save as WAV file
async function saveAudio(base64Data, outputPath) {
  const wavBuffer = Buffer.from(base64Data, "base64");
  const writer = new Writer({
    sampleRate: 24000,
    channels: 1,
    bitDepth: 16,
  });
  const outputStream = createWriteStream(outputPath);
  writer.pipe(outputStream);
  writer.write(wavBuffer);
  writer.end();
  await new Promise((resolve, reject) => {
    outputStream.on("finish", resolve);
    outputStream.on("error", reject);
  });
  console.log(`Audio saved to ${outputPath}`);
}

saveAudio(audioString, "output.wav");

Real-time playback

Decode each Base64 fragment as it arrives and play it directly. This approach requires platform-specific audio libraries.

Python
Node.js

Install pyaudio first:

Platform	Installation
macOS	`brew install portaudio && pip install pyaudio`
Ubuntu / Debian	`sudo apt-get install python-pyaudio python3-pyaudio` or `pip install pyaudio`
CentOS	`sudo yum install -y portaudio portaudio-devel && pip install pyaudio`
Windows	`python -m pip install pyaudio`

import os
from openai import OpenAI
import base64
import numpy as np
import pyaudio
import time

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

messages = [
  {
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "input_audio": {
          "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
          "format": "wav",
        },
      }
    ],
  }
]

completion = client.chat.completions.create(
  model="qwen3-livetranslate-flash",
  messages=messages,
  modalities=["text", "audio"],
  audio={"voice": "Cherry", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
  extra_body={"translation_options": {"source_lang": "zh", "target_lang": "en"}},
)

# Initialize PyAudio for real-time playback
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

for chunk in completion:
  if chunk.choices:
    if hasattr(chunk.choices[0].delta, "audio"):
      try:
        audio_data = chunk.choices[0].delta.audio["data"]
        wav_bytes = base64.b64decode(audio_data)
        audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
        stream.write(audio_np.tobytes())
      except Exception as e:
        print(chunk.choices[0].delta.audio["transcript"])

time.sleep(0.8)
stream.stop_stream()
stream.close()
p.terminate()

Install dependencies first:

Platform	Installation
macOS	`brew install portaudio && npm install speaker`
Ubuntu / Debian	`sudo apt-get install libasound2-dev && npm install speaker`
Windows	`npm install speaker`

import OpenAI from "openai";
import Speaker from "speaker";

const client = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

const messages = [
  {
    role: "user",
    content: [
      {
        type: "input_audio",
        input_audio: {
          data: "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
          format: "wav",
        },
      },
    ],
  },
];

const completion = await client.chat.completions.create({
  model: "qwen3-livetranslate-flash",
  messages: messages,
  modalities: ["text", "audio"],
  audio: { voice: "Cherry", format: "wav" },
  stream: true,
  stream_options: { include_usage: true },
  translation_options: { source_lang: "zh", target_lang: "en" },
});

// Stream audio to speaker in real time
const speaker = new Speaker({
  sampleRate: 24000,
  channels: 1,
  bitDepth: 16,
  signed: true,
});

for await (const chunk of completion) {
  if (Array.isArray(chunk.choices) && chunk.choices.length > 0) {
    if (chunk.choices[0].delta.audio?.data) {
      const pcmBuffer = Buffer.from(chunk.choices[0].delta.audio.data, "base64");
      speaker.write(pcmBuffer);
    }
  } else {
    console.log(chunk.usage);
  }
}

speaker.on("finish", () => console.log("Playback complete"));
speaker.end();

Billing

Audio
Video

Audio token consumption depends on the audio characteristics (such as sample rate). To see actual token usage, set stream_options.include_usage to true and check the usage field in the response.

Audio shorter than 1 second is billed as 1 second.

Video token consumption has two components:

Audio tokens: Token consumption depends on audio characteristics (such as sample rate). Audio shorter than 1 second is billed as 1 second.
Video tokens: Calculated based on frame count and resolution. The formula is:

video_tokens = ceil(frame_count / 2) x (height / 32) x (width / 32) + 2

Where:

Frames are sampled at 2 FPS, clamped to the range [4, 128].
Height and width are adjusted to multiples of 32 pixels and dynamically scaled to fit within the total pixel limit.

Python script to calculate video tokens

# Install: pip install opencv-python
import math
import cv2

FRAME_FACTOR = 2
IMAGE_FACTOR = 32
MAX_RATIO = 200
VIDEO_MIN_PIXELS = 128 * 32 * 32
VIDEO_MAX_PIXELS = 768 * 32 * 32
FPS = 2
FPS_MIN_FRAMES = 4
FPS_MAX_FRAMES = 128
VIDEO_TOTAL_PIXELS = 16384 * 32 * 32

def round_by_factor(number, factor):
  return round(number / factor) * factor

def ceil_by_factor(number, factor):
  return math.ceil(number / factor) * factor

def floor_by_factor(number, factor):
  return math.floor(number / factor) * factor

def get_video(video_path):
  cap = cv2.VideoCapture(video_path)
  frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
  frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
  total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
  video_fps = cap.get(cv2.CAP_PROP_FPS)
  cap.release()
  return frame_height, frame_width, total_frames, video_fps

def smart_nframes(total_frames, video_fps):
  min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR)
  max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR)
  duration = total_frames / video_fps if video_fps != 0 else 0
  if duration - int(duration) > (1 / FPS):
    total_frames = math.ceil(duration * video_fps)
  else:
    total_frames = math.ceil(int(duration) * video_fps)
  nframes = total_frames / video_fps * FPS
  nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
  if not (FRAME_FACTOR <= nframes <= total_frames):
    raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
  return nframes

def smart_resize(height, width, nframes, factor=IMAGE_FACTOR):
  min_pixels = VIDEO_MIN_PIXELS
  total_pixels = VIDEO_TOTAL_PIXELS
  max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
  if max(height, width) / min(height, width) > MAX_RATIO:
    raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")
  h_bar = max(factor, round_by_factor(height, factor))
  w_bar = max(factor, round_by_factor(width, factor))
  if h_bar * w_bar > max_pixels:
    beta = math.sqrt((height * width) / max_pixels)
    h_bar = floor_by_factor(height / beta, factor)
    w_bar = floor_by_factor(width / beta, factor)
  elif h_bar * w_bar < min_pixels:
    beta = math.sqrt(min_pixels / (height * width))
    h_bar = ceil_by_factor(height * beta, factor)
    w_bar = ceil_by_factor(width * beta, factor)
  return h_bar, w_bar

def video_token_calculate(video_path):
  height, width, total_frames, video_fps = get_video(video_path)
  nframes = smart_nframes(total_frames, video_fps)
  resized_height, resized_width = smart_resize(height, width, nframes)
  video_token = int(math.ceil(nframes / FPS) * resized_height / 32 * resized_width / 32)
  video_token += 2
  return video_token

if __name__ == "__main__":
  video_path = "spring_mountain.mp4"  # Replace with your video path
  video_token = video_token_calculate(video_path)
  print("video_tokens:", video_token)

For token pricing, see Choose models.

Supported languages

The following language codes can be used for source and target languages. Some target languages support text output only.

Language code	Language	Supported output
en	English	Audio, text
zh	Chinese	Audio, text
ru	Russian	Audio, text
fr	French	Audio, text
de	German	Audio, text
pt	Portuguese	Audio, text
es	Spanish	Audio, text
it	Italian	Audio, text
id	Indonesian	Text
ko	Korean	Audio, text
ja	Japanese	Audio, text
vi	Vietnamese	Text
th	Thai	Text
ar	Arabic	Text
yue	Cantonese	Audio, text
hi	Hindi	Text
el	Greek	Text
tr	Turkish	Text

Supported voices

Set the voice parameter when the output includes synthesized audio.

Voice name	`voice` parameter	Description	Supported languages
Cherry	Cherry	A cheerful, friendly, and genuine young woman.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Ethan	Ethan	Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nofish	Nofish	A designer who has difficulty pronouncing retroflex consonants.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai-Jada	Jada	A bustling and energetic Shanghai lady.	Chinese
Beijing-Dylan	Dylan	A young man who grew up in the hutongs of Beijing.	Chinese
Sichuan-Sunny	Sunny	A sweet girl from Sichuan.	Chinese
Tianjin-Peter	Peter	A voice in the style of a Tianjin crosstalk performer (the supporting role).	Chinese
Cantonese-Kiki	Kiki	A sweet best friend from Hong Kong.	Cantonese
Sichuan-Eric	Eric	A man from Chengdu, Sichuan, who is unconventional and stands out from the crowd.	Chinese

Alternative: Use Qwen-Omni

You can also use Qwen-Omni (qwen3-omni-flash) with a translation prompt to translate audio and video files.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-flash",
  messages=[{
  "role": "user",
  "content": [
  {"type": "input_audio", "input_audio": {"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav"}},
  {"type": "text", "text": "Translate this audio from English to Chinese."}
  ]
  }],
  modalities=["text"],
  stream=True,
)

for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")

For full Qwen-Omni capabilities including multimodal conversation, see Audio and video file understanding.

FAQ

When I input a video file, what content is translated?

The model translates the audio track from the video. Visual information serves as context to improve translation accuracy. For example, if the audio says "This is a mask":

When the video shows a medical mask, the model translates it as "This is a medical mask."
When the video shows a masquerade mask, the model translates it as "This is a masquerade mask."

API reference

For full input and output parameter details, see Audio and video translation API reference.

​Model details

​Getting started

​Prerequisites

​Send a Base64-encoded local file

​Request parameters

​Input

​Translation options

​Output modality

​Constraints

​Parse the response

​Save audio to a file

​Real-time playback

​Billing

​Supported languages

​Supported voices

​Alternative: Use Qwen-Omni

​FAQ

​When I input a video file, what content is translated?

​API reference

Model details

Getting started

Prerequisites

Send a Base64-encoded local file

Request parameters

Input

Translation options

Output modality

Constraints

Parse the response

Save audio to a file

Real-time playback

Billing

Supported languages

Supported voices

Alternative: Use Qwen-Omni

FAQ

When I input a video file, what content is translated?

API reference