Audio and video file understanding

Getting started

Prerequisites

Get an API key and set it as an environment variable.
Qwen-Omni supports only OpenAI-compatible calls. Install the SDK. The OpenAI Python SDK requires version 1.52.0+. The Node.js SDK requires version 4.68.0+.

This example sends a text prompt to the Qwen-Omni API and returns a streaming response that includes both text and audio.

import os
import base64
import soundfile as sf
import numpy as np
from openai import OpenAI

# 1. Initialize the client
client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),  # Make sure the environment variable is configured
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# 2. Initiate the request
try:
  completion = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    messages=[{"role": "user", "content": "Who are you?"}],
    modalities=["text", "audio"],  # Specify text and audio output
    audio={"voice": "Tina", "format": "wav"},
    stream=True,  # Must be set to True
    stream_options={"include_usage": True},
  )

  # 3. Process the streaming response and decode the audio
  print("Model response:")
  audio_base64_string = ""
  for chunk in completion:
    # Process the text part
    if chunk.choices and chunk.choices[0].delta.content:
      print(chunk.choices[0].delta.content, end="")

    # Collect the audio part
    if chunk.choices and hasattr(chunk.choices[0].delta, "audio") and chunk.choices[0].delta.audio:
      audio_base64_string += chunk.choices[0].delta.audio.get("data", "")

  # 4. Save the audio file
  if audio_base64_string:
    wav_bytes = base64.b64decode(audio_base64_string)
    audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
    sf.write("audio_assistant.wav", audio_np, samplerate=24000)
    print("\nAudio file saved to: audio_assistant.wav")

except Exception as e:
  print(f"Request failed: {e}")

Response

After you run the Python or Node.js code, the text response is returned and an audio file named audio_assistant.wav is saved in the same directory as your code file.

Model response:
I am a large language model developed by Alibaba Cloud. My name is Qwen. How can I help you?

Running the HTTP code returns text and Base64-encoded audio data directly in the audio field.

data: {"choices":[{"delta":{"content":"I"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1757647879,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-a68eca3b-c67e-4666-a72f-73c0b4919860"}
data: {"choices":[{"delta":{"content":" am"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1757647879,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-a68eca3b-c67e-4666-a72f-73c0b4919860"}
......
data: {"choices":[{"delta":{"audio":{"data":"/v8AAAAAAAAAAAAAAA...","expires_at":1757647879,"id":"audio_a68eca3b-c67e-4666-a72f-73c0b4919860"}},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1757647879,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-a68eca3b-c67e-4666-a72f-73c0b4919860"}
data: {"choices":[{"finish_reason":"stop","delta":{"content":""},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1764763585,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-e8c82e9e-073e-4289-a786-a20eb444ac9c"}
data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":207,"completion_tokens":103,"total_tokens":310,"completion_tokens_details":{"audio_tokens":83,"text_tokens":20},"prompt_tokens_details":{"text_tokens":207}},"created":1757940330,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-9cdd5a26-f9e9-4eff-9dcc-93a878165afc"}

Supported languages

Qwen3.5-Omni supported languages

Input languages (74 languages): Chinese, English, German, French, Italian, Czech, Indonesian, Thai, Korean, Polish, Japanese, Vietnamese, Finnish, Portuguese, Spanish, Dutch, Russian, Malay, Catalan, Swedish, Turkish, Ukrainian, Romanian, Slovak, Danish, Icelandic, Norwegian (Bokmal), Macedonian, Greek, Hungarian, Galician, Filipino, Croatian, Bosnian, Slovenian, Bulgarian, Kazakh, Belarusian, Latvian, Estonian, Azerbaijani, Uyghur, Swahili, Hindi, Esperanto, Kyrgyz, Tajik, Cebuano, Afrikaans, Arabic, Lithuanian, Javanese, Bengali, Persian, Hebrew, Punjabi, Gujarati, Mongolian, Asturian, Kannada, Marathi, Interlingua, Malayalam, Maltese, Norwegian Nynorsk, Telugu, Urdu, Georgian, Basque, Tamil, Odia, Serbian, MaoriInput dialects (39 dialects): Northeastern Mandarin, Guizhou dialect, Cantonese, Henan dialect, Hong Kong Cantonese, Shanghainese, Shaanxi dialect, Tianjin dialect, Taiwanese Hokkien, Yunnan dialect, Anhui dialect, Fujian dialect, Gansu dialect, Guangdong dialect, Hubei dialect, Hunan dialect, Jiangxi dialect, Shandong dialect, Shanxi dialect, Sichuan dialect, Guangxi dialect, Hainan dialect, Chongqing dialect, Changsha dialect, Hangzhou dialect, Hefei dialect, Yinchuan dialect, Zhengzhou dialect, Shenyang dialect, Wenzhou dialect, Wuhan dialect, Kunming dialect, Taiyuan dialect, Nanchang dialect, Jinan dialect, Lanzhou dialect, Nanjing dialect, Hakka, Southern MinOutput audio languages (29 languages): Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, PersianOutput audio dialects (7 dialects): Sichuan dialect, Beijing dialect, Tianjin dialect, Nanjing dialect, Shaanxi dialect, Cantonese, Southern Min

Qwen3-Omni-Flash supported languages

Input/Output languages (11 languages): Chinese, English, German, French, Italian, Thai, Korean, Japanese, Russian, Spanish, PortugueseInput/Output dialects (8 dialects): Sichuan dialect, Shanghainese, Cantonese, Southern Min, Shaanxi dialect, Nanjing dialect, Tianjin dialect, Beijing dialect

For model names, context, pricing, and snapshot versions, see Model list. For concurrent request limits, see Rate limits.

Model performance

Audio and video content analysis

Input	Output
Generate a comprehensive, timestamped description of this video.	00:00.000 - 00:02.500 A rain-soaked city street fills the widescreen frame. Long-exposure photography creates streaks of red and blue car lights on wet pavement. A solitary man in a dark knee-length trench coat and light shirt with a tie walks toward the camera along the right sidewalk.00:12.300 - 00:14.533 The camera tilts down. Inside a flashlight beam, a yellowed, aged single sheet lies in a shallow puddle, edges charred and corners curled. The man kneels, extending gloved fingers toward the document; ripples spread outward as his hand nears.Only partial timestamps are shown.

Input

Output

Generate a comprehensive, timestamped description of this video.

00:00.000 - 00:02.500 A rain-soaked city street fills the widescreen frame. Long-exposure photography creates streaks of red and blue car lights on wet pavement. A solitary man in a dark knee-length trench coat and light shirt with a tie walks toward the camera along the right sidewalk.00:12.300 - 00:14.533 The camera tilts down. Inside a flashlight beam, a yellowed, aged single sheet lies in a shallow puddle, edges charred and corners curled. The man kneels, extending gloved fingers toward the document; ripples spread outward as his hand nears.Only partial timestamps are shown.

Multidialect understanding

Input	Output
Describe the vocal characteristics, emotional state, and full lyrics for singers in different Chinese dialects.	This audio features rap performances by singers from various regions of China. Hangzhou dialect (Wu) -- The male singer's voice is soft and gentle, with the distinctive smoothness and warmth of Wu dialect. His intonation rises and falls gently, and his rhythm is light and quick, creating a relaxed storytelling effect. Emotional state: Relaxed, comfortable, and full of everyday life. He describes Hangzhou's slow-paced lifestyle and urban changes, expressing affection for his hometown.Only partial results are shown.

Input

Output

Describe the vocal characteristics, emotional state, and full lyrics for singers in different Chinese dialects.

This audio features rap performances by singers from various regions of China. Hangzhou dialect (Wu) -- The male singer's voice is soft and gentle, with the distinctive smoothness and warmth of Wu dialect. His intonation rises and falls gently, and his rhythm is light and quick, creating a relaxed storytelling effect. Emotional state: Relaxed, comfortable, and full of everyday life. He describes Hangzhou's slow-paced lifestyle and urban changes, expressing affection for his hometown.Only partial results are shown.

Lyric caption generation

Input	Output
Transcribe the song lyrics and provide timestamps for each line.	[00:00:12,680 --> 00:00:16,960] Cat thread sways past moonlight on trees. [00:00:18,400 --> 00:00:22,800] Radiators hum 1998 chart hits. [00:00:24,160 --> 00:00:28,080] Time parts the mist-like heat waves. [00:00:28,920 --> 00:00:33,000] Neon from the screen shines on my nose bridge. ... [00:04:09,000 --> 00:04:10,020] (End)Only partial results are shown.

Audio-video programming

Demo 1	Demo 2

Usage

Streaming output

All requests to Qwen-Omni must set stream=True.

Model configuration

Configure parameters, prompts, and audio-video lengths based on your use case to balance cost, speed, and quality.

Audio-video understanding
Audio understanding

Use case	Recommended video length	Recommended prompt	Recommended max_pixels value
Fast review, low cost	≤60 minutes	Simple prompt within 50 words	230,400
Content extraction (long video segmentation)	≤60 minutes	Simple prompt within 50 words	921,600 to 2,073,600
Standard analysis (short video tagging)	≤4 minutes	Use the structured prompt below	921,600 to 2,073,600
Fine-grained analysis (multiple speakers/complex scenes)	≤2 minutes	Use the structured prompt below	2,073,600

Recommended structured prompt for audio-video understanding

Provide a detailed description of the video.
It should explicitly include three sections: 
1. A structured chronological storyline of **every noticeable audio and visual details**
2. A structured list of all visible text. For each text element, include start timestamp, end timestamp, the exact text content, the appearance characteristics. If no text appears, explicitly state so.
3. A structured speech-to-text transcription, include speaker（Corresponding to the character or voice‑over in Section 1, including their accent and tone）, exact spoken content, start timestamp, end timestamp, and speaking state (prosody, emotion, and style). If no speech appears, explicitly state so.
Aside from these three required sections, you are free to organize any additional content in any way you find helpful. This additional content can include global information about the entire video or localized information about specific moments. You may choose the topic of this extra content freely.
Output Format:
```
## Storyline
<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio and video details.>
<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio and video details.>
<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio and video details.>
...
## Visible Text
<xx:xx.xxx> - <xx:xx.xxx>
"<element>": <appearance>
"<element>": <appearance>
<xx:xx.xxx> - <xx:xx.xxx>
"<element>": <appearance>
"<element>": <appearance>
"<element>": <appearance>
<xx:xx.xxx> - <xx:xx.xxx>
"<element>": <appearance>
...
## Speakers and Transcript
Speaker profiles:
<speaker> - <profile>
<speaker> - <profile>
<speaker> - <profile>
...
<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: "<content>"
<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: "<content>"
<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: "<content>"
...
## <another section>
<paragraphs>
## <another section>
<paragraphs>
...
```

For fine-grained descriptions of long videos, segment them first.

Balance cost and quality by controlling audio length and prompt complexity.

Use case	Recommended audio length	Recommended prompt
Fast review, low cost	≤60 minutes	Simple prompt within 50 words
Content extraction (segment long audio)	≤60 minutes	Simple prompt within 50 words
Standard analysis (audio tagging)	≤2 minutes	Use the structured prompt below
Fine-grained analysis (multiple speakers/complex scenes)	≤1 minute	Use the structured prompt below

Recommended structured prompt for audio understanding

Provide a detailed description of the audio.

It should explicitly include two sections: 

1. A structured chronological storyline of **every noticeable audio details**
2. A structured speech-to-text transcription, include speaker（Corresponding to the character or voice‑over in Section 1, including their accent and tone）, exact spoken content, start timestamp, end timestamp, and speaking state (prosody, emotion, and style). If no speech appears, explicitly state so.

Aside from these two required components, you are free to organize any additional content in any way you find helpful. This additional content can include global information about the entire audio or localized information about specific moments. You may choose the topic of this extra content freely.

Output Format:

```
## Storyline

<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio details.>

<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio details.>

<xx:xx.xxx> - <xx:xx.xxx>
<an unstructured long paragraph in natural language describing what happened during this period, blending both audio details.>

...

...

## Speakers and Transcript

Speaker profiles:
<speaker> - <profile>
<speaker> - <profile>
<speaker> - <profile>
...

<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: "<content>"

<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: "<content>"

<xx:xx.xxx> - <xx:xx.xxx>
Speaker: <speaker>
State: <description>
Content: "<content>"

...

## <another section>

<paragraphs>

## <another section>

<paragraphs>

...
```

For fine-grained descriptions of long audio, segment it first.

Thinking mode

For enable/disable, streaming output, and thinking_budget, see Thinking.

Qwen3-Omni-Flash is a hybrid thinking model (enable_thinking defaults to false). Qwen-Omni-Turbo does not support thinking. In thinking mode, set modalities: ["text"] — audio output is not supported when thinking is enabled.

Web search

The Qwen3.5-Omni series supports web search to retrieve real-time information and perform reasoning. Enable web search using the enable_search parameter and set search_strategy to agent.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

try:
  completion = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    messages=[{
      "role": "user",
      "content": "Please look up today's date and day of the week, and tell me what major holidays fall on this date."
    }],
    stream=True,
    stream_options={"include_usage": True},
    extra_body={
      "enable_search": True,
      "search_options": {
        "search_strategy": "agent"
      }
    }
  )

  print("Model response (includes real-time information):")
  for chunk in completion:
    if chunk.choices and chunk.choices[0].delta.content:
      print(chunk.choices[0].delta.content, end="")
  print()

except Exception as e:
  print(f"Request failed: {e}")

Web search is supported only in the Qwen3.5-Omni series. The search_strategy parameter only accepts agent.
See Pricing for billing information related to the agent strategy.

Multimodal input

Video and text input

You can input video as an image list or as a video file. If you input a video file, the model can also understand the audio in the video. The following sample code uses a video URL from the internet as an example. To input a local video, see Input Base64-encoded local files. Streaming output is required for all calls.

Video file format (can understand audio in the video)

Number of files:
- Qwen3.5-Omni series: Up to 512 files using public URLs; up to 250 files using Base64 encoding.
- Qwen3-Omni-Flash and Qwen-Omni-Turbo series: Only one file allowed.
File size:
- Qwen3.5-Omni: Up to 2 GB, up to 1 hour duration.
- Qwen3-Omni-Flash: Up to 256 MB, up to 150 seconds duration.
- Qwen-Omni-Turbo: Up to 150 MB, up to 40 seconds duration.
File formats: MP4, AVI, MKV, MOV, FLV, WMV, etc.
Visual and audio information in video files are billed separately.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "video_url",
          "video_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
          },
        },
        {"type": "text", "text": "What is the video about?"},
      ],
    },
  ],
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Image list format

Number of images

Qwen3.5-Omni: Minimum 2 images, maximum 2048 images.
Qwen3-Omni-Flash: Minimum 2 images, maximum 128 images.
Qwen-Omni-Turbo: Minimum 4 images, maximum 80 images.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "video",
          "video": [
            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg",
          ],
        },
        {"type": "text", "text": "Describe the process shown in this video"},
      ],
    }
  ],
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Audio and text input

Number of files:
- Qwen3.5-Omni series: Up to 2048 files using public URLs; up to 250 files using Base64 encoding.
- Qwen3-Omni-Flash and Qwen-Omni-Turbo series: Only one file allowed.
File size:
- Qwen3.5-Omni: Up to 2 GB, up to 3 hours duration.
- Qwen3-Omni-Flash: Up to 100 MB, up to 20 minutes duration.
- Qwen-Omni-Turbo: Up to 10 MB, up to 3 minutes duration.
File formats: Supports major formats such as AMR, WAV, 3GP, 3GPP, AAC, and MP3.

To input a local audio file, see Input Base64-encoded local files. Streaming output is required for all calls.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
            "format": "wav",
          },
        },
        {"type": "text", "text": "What is this audio about"},
      ],
    },
  ],
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Image and text input

Qwen-Omni models support multiple image inputs. The requirements for input images are as follows:

Number of images:
- When passed as a public URL: up to 2048 images per request.
- When passed as Base64-encoded strings: up to 250 images per request.
In addition to these per-request caps, the total tokens from all images and all text must be less than the model's maximum input length.
Image size:
- Qwen3.5 series: Each image file must be 20 MB or less.
- Qwen3-Omni-Flash and Qwen-Omni-Turbo series: Each image file must be 10 MB or less.
The width and height of the image must both be greater than 10 pixels. The aspect ratio must not exceed 200:1 or 1:200.
For supported image types, see Visual and video understanding.

The following sample code uses an image URL from the internet as an example. To input a local image, see Input Base64-encoded local file. Streaming output is required for all calls.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          },
        },
        {"type": "text", "text": "What scene is depicted in the image?"},
      ],
    },
  ],
  # Set the output data modality. Two are currently supported: ["text","audio"] and ["text"].
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  # stream must be set to True to avoid errors.
  stream=True,
  stream_options={
    "include_usage": True
  }
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Multi-turn conversation

When you use the multi-turn conversation feature of Qwen-Omni models, note the following:

Assistant message: Assistant messages in the messages array support only text data.
User message: A user message can contain text and data from only one other modality. In a multi-turn conversation, you can use different modalities in separate user messages.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3",
            "format": "mp3",
          },
        },
        {"type": "text", "text": "What is this audio about"},
      ],
    },
    {
      "role": "assistant",
      "content": [{"type": "text", "text": "This audio says: Welcome to Qwen Cloud"}],
    },
    {
      "role": "user",
      "content": [{"type": "text", "text": "Can you tell me about this company?"}],
    },
  ],
  # Set the output data modality. Two are currently supported: ["text","audio"] and ["text"].
  modalities=["text"],
  # stream must be set to True to avoid errors.
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Parse Base64-encoded audio data output

The audio output from Qwen-Omni models is Base64-encoded data delivered in a stream. You can use a string variable to accumulate the Base64 data from each fragment as it arrives. After the stream is complete, decode the final string to create the audio file. Alternatively, decode and play each fragment in real time as it is received.

# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[{"role": "user", "content": "Who are you"}],
  # Set the output data modality. Two are currently supported: ["text","audio"] and ["text"].
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  # stream must be set to True to avoid errors.
  stream=True,
  stream_options={"include_usage": True},
)

# Method 1: Decode after the generation is complete
audio_string = ""
for chunk in completion:
  if chunk.choices:
    if hasattr(chunk.choices[0].delta, "audio"):
      try:
        audio_string += chunk.choices[0].delta.audio["data"]
      except Exception as e:
        print(chunk.choices[0].delta.content)
  else:
    print(chunk.usage)

wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("audio_assistant_py.wav", audio_np, samplerate=24000)

# Method 2: Decode while generating (comment out the code for Method 1 to use Method 2)
# # Initialize PyAudio
# import pyaudio
# import time
# p = pyaudio.PyAudio()
# # Create an audio stream
# stream = p.open(format=pyaudio.paInt16,
#                 channels=1,
#                 rate=24000,
#                 output=True)

# for chunk in completion:
#     if chunk.choices:
#         if hasattr(chunk.choices[0].delta, "audio"):
#             try:
#                 audio_string = chunk.choices[0].delta.audio["data"]
#                 wav_bytes = base64.b64decode(audio_string)
#                 audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
#                 # Play the audio data directly
#                 stream.write(audio_np.tobytes())
#             except Exception as e:
#                 print(chunk.choices[0].delta.content)

# time.sleep(0.8)
# # Clean up resources
# stream.stop_stream()
# stream.close()
# p.terminate()

Input Base64-encoded local files

Images
Audio
Video file
Image list (as video)

This example uses the locally saved file eagle.png.

import os
from openai import OpenAI
import base64

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)


#  Base64 encoding format
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode("utf-8")


base64_image = encode_image("eagle.png")

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {"url": f"data:image/png;base64,{base64_image}"},
        },
        {"type": "text", "text": "What scene is depicted in the image?"},
      ],
    },
  ],
  # Set the output data modality. Two are currently supported: ["text","audio"] and ["text"].
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  # stream must be set to True to avoid errors.
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

This example uses the locally saved file welcome.mp3.

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf
import requests

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)


def encode_audio(audio_path):
  with open(audio_path, "rb") as audio_file:
    return base64.b64encode(audio_file.read()).decode("utf-8")


base64_audio = encode_audio("welcome.mp3")

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": f"data:;base64,{base64_audio}",
            "format": "mp3",
          },
        },
        {"type": "text", "text": "What is this audio about"},
      ],
    },
  ],
  # Set the output data modality. Two are currently supported: ["text","audio"] and ["text"].
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  # stream must be set to True to avoid errors.
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

This example uses the local file spring_mountain.mp4.

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

#  Base64 encoding format
def encode_video(video_path):
  with open(video_path, "rb") as video_file:
    return base64.b64encode(video_file.read()).decode("utf-8")


base64_video = encode_video("spring_mountain.mp4")

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "video_url",
          "video_url": {"url": f"data:;base64,{base64_video}"},
        },
        {"type": "text", "text": "What is she singing?"},
      ],
    },
  ],
  # Set the output data modality. Supported modalities are ["text", "audio"] and ["text"].
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  # stream must be set to True. Otherwise, an error occurs.
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

This example uses the local files football1.jpg, football2.jpg, football3.jpg, and football4.jpg.

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)


#  Base64 encoding format
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode("utf-8")


base64_image_1 = encode_image("football1.jpg")
base64_image_2 = encode_image("football2.jpg")
base64_image_3 = encode_image("football3.jpg")
base64_image_4 = encode_image("football4.jpg")

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "video",
          "video": [
            f"data:image/jpeg;base64,{base64_image_1}",
            f"data:image/jpeg;base64,{base64_image_2}",
            f"data:image/jpeg;base64,{base64_image_3}",
            f"data:image/jpeg;base64,{base64_image_4}",
          ],
        },
        {"type": "text", "text": "Describe the procedure in this video."},
      ],
    }
  ],
  # Set the output data modality. Supported modalities are ["text", "audio"] and ["text"].
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  # stream must be set to True. Otherwise, an error occurs.
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

API reference

For the input and output parameters of Qwen-Omni, see Chat completions API.

Billing and rate limits

Billing rules Qwen-Omni is billed based on the number of tokens for different modalities, such as audio, image, and video. See Pricing for pricing details.

Rules for converting audio, images, and videos to tokens

Audio

Qwen3.5-Omni series: Input audio total tokens = Audio duration (in seconds) x 7; Output audio total tokens = Audio duration (in seconds) x 12.5
Qwen3-Omni-Flash: Total tokens for both input and output audio = Audio duration (in seconds) x 12.5
Qwen-Omni-Turbo: Total tokens for both input and output audio = Audio duration (in seconds) x 25

If the audio duration is less than 1 second, it is calculated as 1 second.Images

Qwen3.5-Omni series and Qwen3-Omni-Flash: 1 token per 32 x 32 pixels.
Qwen-Omni-Turbo: 1 token per 28 x 28 pixels.

Qwen3.5-Omni series requires a minimum of 24 tokens per image, while other models require a minimum of 4 tokens. The default maximum is 1280 tokens. Qwen3.5-Omni series supports the vl_high_resolution_images parameter to increase the resolution limit to 16384 tokens (not supported by Qwen-Omni-Turbo or Qwen3-Omni-Flash). Use the following code to estimate the total tokens for an image:

import math
from PIL import Image  # pip install Pillow

# ============ Model parameter configuration (modify as needed) ============

# Image factor: 32 for Qwen3.5-Omni series and Qwen3-Omni-Flash; 28 for Qwen-Omni-Turbo
IMAGE_FACTOR = 32

# Token minimum: 24 for Qwen3.5-Omni series; 4 for Qwen-Omni-Turbo and Qwen3-Omni-Flash
MIN_TOKENS = 24

# High-resolution mode (only supported by Qwen3.5-Omni series, not by Qwen-Omni-Turbo or Qwen3-Omni-Flash)
# True  → Token maximum 16384
# False → Token maximum 1280 (default)
VL_HIGH_RESOLUTION_IMAGES = False

# ============ Pixel range (auto-calculated from above) ============

MIN_PIXELS = MIN_TOKENS * IMAGE_FACTOR * IMAGE_FACTOR
MAX_PIXELS = (16384 if VL_HIGH_RESOLUTION_IMAGES else 1280) * IMAGE_FACTOR * IMAGE_FACTOR


def smart_resize(height, width, factor=IMAGE_FACTOR,
                 min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS):
  """Align image dimensions to factor multiples and scale to [min_pixels, max_pixels] range."""
  h_bar = max(factor, round(height / factor) * factor)
  w_bar = max(factor, round(width / factor) * factor)

  if h_bar * w_bar > max_pixels:
    beta = math.sqrt((height * width) / max_pixels)
    h_bar = math.floor(height / beta / factor) * factor
    w_bar = math.floor(width / beta / factor) * factor
  elif h_bar * w_bar < min_pixels:
    beta = math.sqrt(min_pixels / (height * width))
    h_bar = math.ceil(height * beta / factor) * factor
    w_bar = math.ceil(width * beta / factor) * factor

  return h_bar, w_bar


def token_calculate(image_path=''):
  if len(image_path) > 0:
    image = Image.open(image_path)
    height = image.height
    width = image.width
    print(f"Original size: {width}x{height}")
    resized_h, resized_w = smart_resize(height, width)
    token = int(resized_h * resized_w / (IMAGE_FACTOR * IMAGE_FACTOR)) + 2
    print(f"Resized: {resized_w}x{resized_h}, Tokens: {token}")
    return token
  else:
    raise ValueError("Image path cannot be empty. Provide a valid image file path")

if __name__ == "__main__":
  token = token_calculate(image_path="xxx/test.jpg")

VideoVideo files generate two types of tokens: video_tokens (visual) and audio_tokens (audio).

video_tokens

The calculation procedure is complex. For more information, see the following code:

# Before use, install: pip install opencv-python
import math
import os
import logging
import cv2

# Fixed parameters
FRAME_FACTOR = 2

# For Qwen3.5-Omni and Qwen3-Omni-Flash, IMAGE_FACTOR is 32
IMAGE_FACTOR = 32

# For Qwen-Omni-Turbo, IMAGE_FACTOR is 28
# IMAGE_FACTOR = 28

# Aspect ratio of video frames
MAX_RATIO = 200

# Lower limit for video frame pixels. For Qwen3.5-Omni and Qwen3-Omni-Flash: 128 * 32 * 32
VIDEO_MIN_PIXELS = 128 * 32 * 32
# For Qwen-Omni-Turbo
# VIDEO_MIN_PIXELS = 128 * 28 * 28

# Upper limit for video frame pixels. For Qwen3.5-Omni and Qwen3-Omni-Flash: 768 * 32 * 32
VIDEO_MAX_PIXELS = 768 * 32 * 32
# For Qwen-Omni-Turbo:
# VIDEO_MAX_PIXELS = 768 * 28 * 28

FPS = 2
# Minimum number of extracted frames
FPS_MIN_FRAMES = 4

# Maximum number of extracted frames
# Maximum number of extracted frames for Qwen3.5-Omni and Qwen3-Omni-Flash: 128
# Maximum number of extracted frames for Qwen-Omni-Turbo: 80
FPS_MAX_FRAMES = 128

# Maximum pixel value for video input. For Qwen3.5-Omni and Qwen3-Omni-Flash: 16384 * 32 * 32
VIDEO_TOTAL_PIXELS = 16384 * 32 * 32
# For Qwen-Omni-Turbo:
# VIDEO_TOTAL_PIXELS = 16384 * 28 * 28

def round_by_factor(number, factor):
  return round(number / factor) * factor

def ceil_by_factor(number, factor):
  return math.ceil(number / factor) * factor

def floor_by_factor(number, factor):
  return math.floor(number / factor) * factor

def get_video(video_path):
  cap = cv2.VideoCapture(video_path)
  frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
  frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
  total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
  video_fps = cap.get(cv2.CAP_PROP_FPS)
  cap.release()
  return frame_height, frame_width, total_frames, video_fps

def smart_nframes(total_frames, video_fps):
  min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR)
  max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR)
  duration = total_frames / video_fps if video_fps != 0 else 0
  if duration - int(duration) > (1 / FPS):
    total_frames = math.ceil(duration * video_fps)
  else:
    total_frames = math.ceil(int(duration) * video_fps)
  nframes = total_frames / video_fps * FPS
  nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
  if not (FRAME_FACTOR <= nframes <= total_frames):
    raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
  return nframes

def smart_resize(height, width, nframes, factor=IMAGE_FACTOR):
  min_pixels = VIDEO_MIN_PIXELS
  total_pixels = VIDEO_TOTAL_PIXELS
  max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
  if max(height, width) / min(height, width) > MAX_RATIO:
    raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")
  h_bar = max(factor, round_by_factor(height, factor))
  w_bar = max(factor, round_by_factor(width, factor))
  if h_bar * w_bar > max_pixels:
    beta = math.sqrt((height * width) / max_pixels)
    h_bar = floor_by_factor(height / beta, factor)
    w_bar = floor_by_factor(width / beta, factor)
  elif h_bar * w_bar < min_pixels:
    beta = math.sqrt(min_pixels / (height * width))
    h_bar = ceil_by_factor(height * beta, factor)
    w_bar = ceil_by_factor(width * beta, factor)
  return h_bar, w_bar

def video_token_calculate(video_path):
  height, width, total_frames, video_fps = get_video(video_path)
  nframes = smart_nframes(total_frames, video_fps)
  resized_height, resized_width = smart_resize(height, width, nframes)
  video_token = int(math.ceil(nframes / FPS) * resized_height / 32 * resized_width / 32)
  video_token += 2  # Visual marks
  return video_token

if __name__ == "__main__":
  video_path = "spring_mountain.mp4"  # Your video path
  video_token = video_token_calculate(video_path)
  print("video_tokens:", video_token)

audio_tokens
- Qwen3.5-Omni series: Input audio total tokens = Audio duration (in seconds) x 7; Output audio total tokens = Audio duration (in seconds) x 12.5
- Qwen3-Omni-Flash: Total tokens for both input and output audio = Audio duration (in seconds) x 12.5
- Qwen-Omni-Turbo: Total tokens for both input and output audio = Audio duration (in seconds) x 25
- If the audio duration is less than 1 second, it is calculated as 1 second.

Free quota For more information about how to claim, query, and use your free quota, see Free quota for new users. Rate limits For model rate limit rules and FAQ, see Rate limits.

Error codes

If a call fails, see Error messages.

Voice list

To use a voice, set the voice request parameter to the corresponding value in the voice parameter column of the tables below.

qwen3.5-omni

Voice name	`voice` parameter	Description	Supported languages
Tina	Tina	A voice like warm milk tea -- sweet and cozy, yet sharp when solving problems	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Cindy	Cindy	A sweet-talking young woman from Taiwan	Chinese (Taiwanese accent), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Liora Mira	Liora Mira	A gentle voice that weaves warmth into everyday life	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Sunnybobi	Sunnybobi	A cheerful, socially awkward neighbor girl	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Raymond	Raymond	A clear-voiced, takeout-loving homebody	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Ethan	Ethan	Standard Mandarin with a slight northern accent. Bright, warm, energetic, and youthful	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Theo Calm	Theo Calm	Conveys understanding in silence and healing through words	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Serena	Serena	A gentle young woman	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Harvey	Harvey	A voice that carries the weight of time -- deep, mellow, and scented with coffee and old books	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Maia	Maia	A blend of intellect and gentleness	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Evan	Evan	A college student -- youthful and endearing	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Qiao	Qiao	Not just cute -- she's sweet on the surface and full of personality underneath	Chinese (Taiwanese accent), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Momo	Momo	Playful and mischievous -- here to cheer you up	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Wil	Wil	A young man from Shenzhen who speaks with a Hong Kong-Taiwan accent	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Angel	Angel	Slightly Taiwanese-accented -- and very sweet	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Li Cassian	Li Cassian	Speaks with restraint -- three parts silence, seven parts reading the room	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Mia	Mia	A lifestyle artist who shares slow-living aesthetics through a soothing voice	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Joyner	Joyner	Funny, exaggerated, and down-to-earth	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Gold	Gold	A West Coast Black rapper	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Katerina	Katerina	A mature, commanding voice with rich rhythm and resonance	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Ryan	Ryan	High-energy delivery with strong dramatic presence -- realism meets intensity	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Jennifer	Jennifer	A premium, cinematic-quality American female voice	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Aiden	Aiden	An American young man skilled in cooking	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Mione	Mione	A mature, intelligent British neighbor girl	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Sichuan - Sunny	Sunny	A sweet Sichuan girl who warms your heart	Chinese (Sichuan dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Beijing - Dylan	Dylan	A youth raised in Beijing's hutongs	Chinese (Beijing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Sichuan - Eric	Eric	A lively Chengdu man from Sichuan	Chinese (Sichuan dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Tianjin - Peter	Peter	A Tianjin-style xiangsheng performer -- professional foil	Chinese (Tianjin dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Joseph Chen	Joseph Chen	A longtime overseas Chinese from Southeast Asia with a warm, nostalgic voice	Chinese (Hokkien), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Shaanxi - Marcus	Marcus	Broad face, few words, sincere heart, deep voice -- the true flavor of Shaanxi	Chinese (Shaanxi dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Nanjing - Li	Li	A grumpy uncle	Chinese (Nanjing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Cantonese - Rocky	Rocky	A witty and humorous online chat companion	Chinese (Cantonese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Sohee	Sohee	A warm, cheerful, emotionally expressive Korean unnie	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Lenn	Lenn	Rational at core, rebellious in detail -- a German youth who wears suits and listens to post-punk	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Ono Anna	Ono Anna	A clever, playful childhood friend	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Sonrisa	Sonrisa	A warm, outgoing Latin American woman	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Bodega	Bodega	A warm, enthusiastic Spanish man	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Emilien	Emilien	A romantic French big brother	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Andre	Andre	A magnetic, natural, and steady male voice	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Radio Gol	Radio Gol	A passionate football commentator who narrates games with poetic flair	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Alek	Alek	Cold like the Russian spirit -- yet warm as wool beneath a coat	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Rizky	Rizky	A young Indonesian man with a distinctive voice	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Roya	Roya	A sporty girl with a free-spirited heart	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Arda	Arda	Neither high nor low -- clean, crisp, and gently warm	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Hana	Hana	A mature Vietnamese woman who loves dogs	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Dolce	Dolce	A laid-back Italian man	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Jakub	Jakub	A charismatic, artistic young man from a Polish town	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Griet	Griet	A mature, artistic Dutch woman	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Eliska	Eliska	Every word carries Central European craftsmanship and warmth	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Marina	Marina	A girl raised in a multicultural city	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Siiri	Siiri	Reserved and gentle -- with a calm, lake-like speaking pace	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Ingrid	Ingrid	A woman from rural Norway	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Sigga	Sigga	An intellectual young woman from an Icelandic town	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Bea	Bea	A sweet Filipino woman who loves coffee	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Chloe	Chloe	A Malaysian office worker	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

qwen3-omni-flash-2025-12-01

Voice name	`voice` parameter	Description	Supported languages
Cherry	Cherry	A sunny, positive, friendly, and natural young woman	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Serena	Serena	A gentle young woman	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Ethan	Ethan	Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Chelsie	Chelsie	A two-dimensional virtual girlfriend	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Momo	Momo	Playful and mischievous, cheering you up	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Vivian	Vivian	Confident, cute, and slightly feisty	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Moon	Moon	Effortlessly cool Moon White	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Maia	Maia	A blend of intellect and gentleness	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Kai	Kai	A soothing audio spa for your ears	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nofish	Nofish	A designer who cannot pronounce retroflex sounds	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Bella	Bella	A little girl who drinks but never throws punches when drunk	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Jennifer	Jennifer	A premium, cinematic-quality American English female voice	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Ryan	Ryan	Full of rhythm, bursting with dramatic flair, balancing authenticity and tension	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Katerina	Katerina	A mature-woman voice with rich, memorable rhythm	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Aiden	Aiden	An American English young man skilled in cooking	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Mia	Mia	Gentle as spring water, obedient as fresh snow	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Mochi	Mochi	A clever, quick-witted young adult -- childlike innocence remains, yet wisdom shines through	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Bellona	Bellona	A powerful, clear voice that brings characters to life -- so stirring it makes your blood boil	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Vincent	Vincent	A uniquely raspy, smoky voice -- just one line evokes armies and heroic tales	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Bunny	Bunny	A little girl overflowing with cuteness	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Neil	Neil	A flat baseline intonation with precise, clear pronunciation -- the most professional news anchor	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Elias	Elias	Maintains academic rigor while using storytelling techniques to turn complex knowledge into digestible learning modules	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Arthur	Arthur	A simple, earthy voice steeped in time and tobacco smoke -- slowly unfolding village stories and curiosities	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nini	Nini	A soft, clingy voice like sweet rice cakes	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Ebona	Ebona	Her whisper is like a rusty key slowly turning in the darkest corner of your mind	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Seren	Seren	A gentle, soothing voice to help you fall asleep faster. Good night, sweet dreams	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Pip	Pip	A playful, mischievous boy full of childlike wonder	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Stella	Stella	Normally a cloyingly sweet, dazed teenage-girl voice -- but when shouting battle cries, she instantly radiates unwavering love and justice	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Bodega	Bodega	A passionate Spanish man	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sonrisa	Sonrisa	A cheerful, outgoing Latin American woman	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Alek	Alek	Cold like the Russian spirit, yet warm like wool coat lining	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Dolce	Dolce	A laid-back Italian man	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sohee	Sohee	A warm, cheerful, emotionally expressive Korean unnie	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Lenn	Lenn	Rational at heart, rebellious in detail -- a German youth who wears suits and listens to post-punk	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Emilien	Emilien	A romantic French big brother	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Andre	Andre	A magnetic, natural, and steady male voice	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai - Jada	Jada	A fast-paced, energetic Shanghai auntie	Shanghainese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Beijing - Dylan	Dylan	A young man raised in Beijing's hutongs	Beijing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sichuan - Sunny	Sunny	A Sichuan girl sweet enough to melt your heart	Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nanjing - Li	Li	A patient yoga teacher	Nanjing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shaanxi - Marcus	Marcus	A man with a broad face, few words, a sincere heart, and deep roots	Shaanxi dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Southern Min - Roy	Roy	A humorous, straightforward, lively Taiwanese man	Southern Min, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Tianjin - Peter	Peter	A Tianjin-style crosstalk performer and professional food critic	Tianjin dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Cantonese - Rocky	Rocky	A humorous, witty man providing live commentary	Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Cantonese - Kiki	Kiki	A sweet Hong Kong girl best friend	Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sichuan - Eric	Eric	A Sichuanese man from Chengdu who stands out in every crowd	Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

qwen3-omni-flash and qwen3-omni-flash-2025-09-15

Voice name	`voice` parameter	Description	Supported languages
Cherry	Cherry	A sunny, positive, friendly, and natural young woman	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Ethan	Ethan	Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nofish	Nofish	A designer who cannot pronounce retroflex sounds	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Jennifer	Jennifer	A premium, cinematic-quality American English female voice	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Ryan	Ryan	Full of rhythm, bursting with dramatic flair, balancing authenticity and tension	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Katerina	Katerina	A mature-woman voice with rich, memorable rhythm	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Elias	Elias	Maintains academic rigor while using storytelling techniques to turn complex knowledge into digestible learning modules	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai - Jada	Jada	A fast-paced, energetic Shanghai auntie	Shanghainese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Beijing - Dylan	Dylan	A young man raised in Beijing's hutongs	Beijing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sichuan - Sunny	Sunny	A Sichuan girl sweet enough to melt your heart	Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nanjing - Li	Li	A patient yoga teacher	Nanjing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shaanxi - Marcus	Marcus	A man with a broad face, few words, a sincere heart, and deep roots	Shaanxi dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Southern Min - Roy	Roy	A humorous, straightforward, lively Taiwanese man	Southern Min, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Tianjin - Peter	Peter	A Tianjin-style crosstalk performer and professional food critic	Tianjin dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Cantonese - Rocky	Rocky	A humorous, witty man providing live commentary	Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Cantonese - Kiki	Kiki	A sweet Hong Kong girl best friend	Cantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sichuan - Eric	Eric	A Sichuanese man from Chengdu who stands out in every crowd	Sichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Qwen-Omni-Turbo

Voice name	`voice` parameter	Description	Supported languages
Cherry	Cherry	A sunny, positive, friendly, and natural young woman	Chinese, English
Serena	Serena	A gentle young woman	Chinese, English
Ethan	Ethan	Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant	Chinese, English
Chelsie	Chelsie	A two-dimensional virtual girlfriend	Chinese, English

Open-source Qwen-Omni models

Voice name	`voice` parameter	Description	Supported languages
Ethan	Ethan	Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant	Chinese, English
Chelsie	Chelsie	A two-dimensional virtual girlfriend	Chinese, English

​Getting started

​Supported languages

​Model performance

​Audio and video content analysis

​Multidialect understanding

​Lyric caption generation

​Audio-video programming

​Usage

​Streaming output

​Model configuration

​Thinking mode

​Web search

​Multimodal input

​Video and text input

​Video file format (can understand audio in the video)

​Image list format

​Audio and text input

​Image and text input

​Multi-turn conversation

​Parse Base64-encoded audio data output

​Input Base64-encoded local files

​API reference

​Billing and rate limits

​Error codes

​Voice list

​qwen3.5-omni

​qwen3-omni-flash-2025-12-01

​qwen3-omni-flash and qwen3-omni-flash-2025-09-15

​Qwen-Omni-Turbo

​Open-source Qwen-Omni models

Getting started

Supported languages

Model performance

Audio and video content analysis

Multidialect understanding

Lyric caption generation

Audio-video programming

Usage

Streaming output

Model configuration

Thinking mode

Web search

Multimodal input

Video and text input

Video file format (can understand audio in the video)

Image list format

Audio and text input

Image and text input

Multi-turn conversation

Parse Base64-encoded audio data output

Input Base64-encoded local files

API reference

Billing and rate limits

Error codes

Voice list

qwen3.5-omni

qwen3-omni-flash-2025-12-01

qwen3-omni-flash and qwen3-omni-flash-2025-09-15

Qwen-Omni-Turbo

Open-source Qwen-Omni models