Skip to main content
Conversation

Audio and video file understanding

Text + image/audio input

Getting started

Prerequisites This example sends a text prompt to the Qwen-Omni API and returns a streaming response that includes both text and audio.
import os
import base64
import soundfile as sf
import numpy as np
from openai import OpenAI

# 1. Initialize the client
client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),  # Make sure the environment variable is configured
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

# 2. Initiate the request
try:
  completion = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    messages=[{"role": "user", "content": "Who are you?"}],
    modalities=["text", "audio"],  # Specify text and audio output
    audio={"voice": "Tina", "format": "wav"},
    stream=True,  # Must be set to True
    stream_options={"include_usage": True},
  )

  # 3. Process the streaming response and decode the audio
  print("Model response:")
  audio_base64_string = ""
  for chunk in completion:
    # Process the text part
    if chunk.choices and chunk.choices[0].delta.content:
      print(chunk.choices[0].delta.content, end="")

    # Collect the audio part
    if chunk.choices and hasattr(chunk.choices[0].delta, "audio") and chunk.choices[0].delta.audio:
      audio_base64_string += chunk.choices[0].delta.audio.get("data", "")

  # 4. Save the audio file
  if audio_base64_string:
    wav_bytes = base64.b64decode(audio_base64_string)
    audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
    sf.write("audio_assistant.wav", audio_np, samplerate=24000)
    print("\nAudio file saved to: audio_assistant.wav")

except Exception as e:
  print(f"Request failed: {e}")
After you run the Python or Node.js code, the text response is returned and an audio file named audio_assistant.wav is saved in the same directory as your code file.
Model response:
I am a large language model developed by Alibaba Cloud. My name is Qwen. How can I help you?
Running the HTTP code returns text and Base64-encoded audio data directly in the audio field.
data: {"choices":[{"delta":{"content":"I"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1757647879,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-a68eca3b-c67e-4666-a72f-73c0b4919860"}
data: {"choices":[{"delta":{"content":" am"},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1757647879,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-a68eca3b-c67e-4666-a72f-73c0b4919860"}
......
data: {"choices":[{"delta":{"audio":{"data":"/v8AAAAAAAAAAAAAAA...","expires_at":1757647879,"id":"audio_a68eca3b-c67e-4666-a72f-73c0b4919860"}},"finish_reason":null,"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1757647879,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-a68eca3b-c67e-4666-a72f-73c0b4919860"}
data: {"choices":[{"finish_reason":"stop","delta":{"content":""},"index":0,"logprobs":null}],"object":"chat.completion.chunk","usage":null,"created":1764763585,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-e8c82e9e-073e-4289-a786-a20eb444ac9c"}
data: {"choices":[],"object":"chat.completion.chunk","usage":{"prompt_tokens":207,"completion_tokens":103,"total_tokens":310,"completion_tokens_details":{"audio_tokens":83,"text_tokens":20},"prompt_tokens_details":{"text_tokens":207}},"created":1757940330,"system_fingerprint":null,"model":"qwen3.5-omni-plus","id":"chatcmpl-9cdd5a26-f9e9-4eff-9dcc-93a878165afc"}

Model selection

The Qwen3.5-Omni series models are currently in invitational preview. Model calls are free for a limited time. This does not include tool calling fees. For tool calling billing, see Pricing.
  • Qwen3.5-Omni series: Best for long video analysis, meeting summaries, caption generation, content moderation, and audio-video interaction.
    • Input limits: Up to 3 hours of audio or up to 1 hour of video
    • Audio control: Supports adjusting volume, speaking rate, and emotion via instructions
    • Visual capability: Matches Qwen3.5's level. Understands images, speech, sound effects, and other multimodal information
  • Qwen3-Omni-Flash series: Best for short video analysis and cost-sensitive scenarios.
    • Input limits: Audio-video input under 150 seconds
    • Thinking mode: The only Qwen-Omni series that supports thinking mode
  • Qwen-Omni-Turbo series: This series is no longer updated and has limited features. Migrate to the Qwen3.5-Omni or Qwen3-Omni-Flash series.
SeriesAudio-video descriptionDeep thinkingWeb searchInput languagesOutput audio languagesVoices
Qwen3.5-OmniStrongNot supportedSupported113 types (74 languages + 39 dialects)36 types (29 languages + 7 dialects)55
Qwen3-Omni-FlashWeakerSupportedNot supported19 types (11 languages + 8 dialects)19 types (11 languages + 8 dialects)17-49 (varies by version)
Qwen-Omni-Turbo (No longer updated)NoneNot supportedNot supportedChinese, EnglishChinese, English4
Input languages (74 languages): Chinese, English, German, French, Italian, Czech, Indonesian, Thai, Korean, Polish, Japanese, Vietnamese, Finnish, Portuguese, Spanish, Dutch, Russian, Malay, Catalan, Swedish, Turkish, Ukrainian, Romanian, Slovak, Danish, Icelandic, Norwegian (Bokmal), Macedonian, Greek, Hungarian, Galician, Filipino, Croatian, Bosnian, Slovenian, Bulgarian, Kazakh, Belarusian, Latvian, Estonian, Azerbaijani, Uyghur, Swahili, Hindi, Esperanto, Kyrgyz, Tajik, Cebuano, Afrikaans, Arabic, Lithuanian, Javanese, Bengali, Persian, Hebrew, Punjabi, Gujarati, Mongolian, Asturian, Kannada, Marathi, Interlingua, Malayalam, Maltese, Norwegian Nynorsk, Telugu, Urdu, Georgian, Basque, Tamil, Odia, Serbian, MaoriInput dialects (39 dialects): Northeastern Mandarin, Guizhou dialect, Cantonese, Henan dialect, Hong Kong Cantonese, Shanghainese, Shaanxi dialect, Tianjin dialect, Taiwanese Hokkien, Yunnan dialect, Anhui dialect, Fujian dialect, Gansu dialect, Guangdong dialect, Hubei dialect, Hunan dialect, Jiangxi dialect, Shandong dialect, Shanxi dialect, Sichuan dialect, Guangxi dialect, Hainan dialect, Chongqing dialect, Changsha dialect, Hangzhou dialect, Hefei dialect, Yinchuan dialect, Zhengzhou dialect, Shenyang dialect, Wenzhou dialect, Wuhan dialect, Kunming dialect, Taiyuan dialect, Nanchang dialect, Jinan dialect, Lanzhou dialect, Nanjing dialect, Hakka, Southern MinOutput audio languages (29 languages): Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, PersianOutput audio dialects (7 dialects): Sichuan dialect, Beijing dialect, Tianjin dialect, Nanjing dialect, Shaanxi dialect, Cantonese, Southern Min
Input/Output languages (11 languages): Chinese, English, German, French, Italian, Thai, Korean, Japanese, Russian, Spanish, PortugueseInput/Output dialects (8 dialects): Sichuan dialect, Shanghainese, Cantonese, Southern Min, Shaanxi dialect, Nanjing dialect, Tianjin dialect, Beijing dialect
For model names, context, pricing, and snapshot versions, see Model list. For concurrent request limits, see Rate limits.

Model performance

Audio and video content analysis

InputOutput
Generate a comprehensive, timestamped description of this video.00:00.000 - 00:02.500 A rain-soaked city street fills the widescreen frame. Long-exposure photography creates streaks of red and blue car lights on wet pavement. A solitary man in a dark knee-length trench coat and light shirt with a tie walks toward the camera along the right sidewalk.00:12.300 - 00:14.533 The camera tilts down. Inside a flashlight beam, a yellowed, aged single sheet lies in a shallow puddle, edges charred and corners curled. The man kneels, extending gloved fingers toward the document; ripples spread outward as his hand nears.Only partial timestamps are shown.

Multidialect understanding

InputOutput
Describe the vocal characteristics, emotional state, and full lyrics for singers in different Chinese dialects.This audio features rap performances by singers from various regions of China. Hangzhou dialect (Wu) -- The male singer's voice is soft and gentle, with the distinctive smoothness and warmth of Wu dialect. His intonation rises and falls gently, and his rhythm is light and quick, creating a relaxed storytelling effect. Emotional state: Relaxed, comfortable, and full of everyday life. He describes Hangzhou's slow-paced lifestyle and urban changes, expressing affection for his hometown.Only partial results are shown.

Lyric caption generation

InputOutput
Transcribe the song lyrics and provide timestamps for each line.[00:00:12,680 --> 00:00:16,960] Cat thread sways past moonlight on trees. [00:00:18,400 --> 00:00:22,800] Radiators hum 1998 chart hits. [00:00:24,160 --> 00:00:28,080] Time parts the mist-like heat waves. [00:00:28,920 --> 00:00:33,000] Neon from the screen shines on my nose bridge. ... [00:04:09,000 --> 00:04:10,020] (End)Only partial results are shown.

Audio-video programming

Demo 1Demo 2

Usage

Streaming output

All requests to Qwen-Omni must set stream=True.

Model configuration

Configure parameters, prompts, and audio-video lengths based on your use case to balance cost, speed, and quality.
  • Audio-video understanding
  • Audio understanding
Use caseRecommended video lengthRecommended promptRecommended max_pixels value
Fast review, low cost≤60 minutesSimple prompt within 50 words230,400
Content extraction (long video segmentation)≤60 minutesSimple prompt within 50 words921,600 to 2,073,600
Standard analysis (short video tagging)≤4 minutesUse the structured prompt below921,600 to 2,073,600
Fine-grained analysis (multiple speakers/complex scenes)≤2 minutesUse the structured prompt below2,073,600
For fine-grained descriptions of long videos, segment them first.

Thinking mode

For enable/disable, streaming output, and thinking_budget, see Thinking.
Qwen3-Omni-Flash is a hybrid thinking model (enable_thinking defaults to false). Qwen-Omni-Turbo does not support thinking. In thinking mode, set modalities: ["text"] — audio output is not supported when thinking is enabled. The Qwen3.5-Omni series supports web search to retrieve real-time information and perform reasoning. Enable web search using the enable_search parameter and set search_strategy to agent.
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

try:
  completion = client.chat.completions.create(
    model="qwen3.5-omni-plus",
    messages=[{
      "role": "user",
      "content": "Please look up today's date and day of the week, and tell me what major holidays fall on this date."
    }],
    stream=True,
    stream_options={"include_usage": True},
    extra_body={
      "enable_search": True,
      "search_options": {
        "search_strategy": "agent"
      }
    }
  )

  print("Model response (includes real-time information):")
  for chunk in completion:
    if chunk.choices and chunk.choices[0].delta.content:
      print(chunk.choices[0].delta.content, end="")
  print()

except Exception as e:
  print(f"Request failed: {e}")
  • Web search is supported only in the Qwen3.5-Omni series. The search_strategy parameter only accepts agent.
  • See Pricing for billing information related to the agent strategy.

Multimodal input

Video and text input

You can input video as an image list or as a video file. If you input a video file, the model can also understand the audio in the video. The following sample code uses a video URL from the internet as an example. To input a local video, see Input Base64-encoded local files. Streaming output is required for all calls.

Video file format (can understand audio in the video)

  • Number of files:
    • Qwen3.5-Omni series: Up to 512 files using public URLs; up to 250 files using Base64 encoding.
    • Qwen3-Omni-Flash and Qwen-Omni-Turbo series: Only one file allowed.
  • File size:
    • Qwen3.5-Omni: Up to 2 GB, up to 1 hour duration.
    • Qwen3-Omni-Flash: Up to 256 MB, up to 150 seconds duration.
    • Qwen-Omni-Turbo: Up to 150 MB, up to 40 seconds duration.
  • File formats: MP4, AVI, MKV, MOV, FLV, WMV, etc.
  • Visual and audio information in video files are billed separately.
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "video_url",
          "video_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
          },
        },
        {"type": "text", "text": "What is the video about?"},
      ],
    },
  ],
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Image list format

Number of images
  • Qwen3.5-Omni: Minimum 2 images, maximum 2048 images.
  • Qwen3-Omni-Flash: Minimum 2 images, maximum 128 images.
  • Qwen-Omni-Turbo: Minimum 4 images, maximum 80 images.
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "video",
          "video": [
            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/xzsgiz/football1.jpg",
            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/tdescd/football2.jpg",
            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/zefdja/football3.jpg",
            "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/aedbqh/football4.jpg",
          ],
        },
        {"type": "text", "text": "Describe the process shown in this video"},
      ],
    }
  ],
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Audio and text input

  • Number of files:
    • Qwen3.5-Omni series: Up to 2048 files using public URLs; up to 250 files using Base64 encoding.
    • Qwen3-Omni-Flash and Qwen-Omni-Turbo series: Only one file allowed.
  • File size:
    • Qwen3.5-Omni: Up to 2 GB, up to 3 hours duration.
    • Qwen3-Omni-Flash: Up to 100 MB, up to 20 minutes duration.
    • Qwen-Omni-Turbo: Up to 10 MB, up to 3 minutes duration.
  • File formats: Supports major formats such as AMR, WAV, 3GP, 3GPP, AAC, and MP3.
To input a local audio file, see Input Base64-encoded local files. Streaming output is required for all calls.
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav",
            "format": "wav",
          },
        },
        {"type": "text", "text": "What is this audio about"},
      ],
    },
  ],
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Image and text input

Qwen-Omni models support multiple image inputs. The requirements for input images are as follows:
  • Number of images:
    • When passed as a public URL: up to 2048 images per request.
    • When passed as Base64-encoded strings: up to 250 images per request.
    In addition to these per-request caps, the total tokens from all images and all text must be less than the model's maximum input length.
  • Image size:
    • Qwen3.5 series: Each image file must be 20 MB or less.
    • Qwen3-Omni-Flash and Qwen-Omni-Turbo series: Each image file must be 10 MB or less.
  • The width and height of the image must both be greater than 10 pixels. The aspect ratio must not exceed 200:1 or 1:200.
  • For supported image types, see Visual and video understanding.
The following sample code uses an image URL from the internet as an example. To input a local image, see Input Base64-encoded local file. Streaming output is required for all calls.
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          },
        },
        {"type": "text", "text": "What scene is depicted in the image?"},
      ],
    },
  ],
  # Set the output data modality. Two are currently supported: ["text","audio"] and ["text"].
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  # stream must be set to True to avoid errors.
  stream=True,
  stream_options={
    "include_usage": True
  }
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Multi-turn conversation

When you use the multi-turn conversation feature of Qwen-Omni models, note the following:
  • Assistant message: Assistant messages in the messages array support only text data.
  • User message: A user message can contain text and data from only one other modality. In a multi-turn conversation, you can use different modalities in separate user messages.
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3",
            "format": "mp3",
          },
        },
        {"type": "text", "text": "What is this audio about"},
      ],
    },
    {
      "role": "assistant",
      "content": [{"type": "text", "text": "This audio says: Welcome to Qwen Cloud"}],
    },
    {
      "role": "user",
      "content": [{"type": "text", "text": "Can you tell me about this company?"}],
    },
  ],
  # Set the output data modality. Two are currently supported: ["text","audio"] and ["text"].
  modalities=["text"],
  # stream must be set to True to avoid errors.
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

Parse Base64-encoded audio data output

The audio output from Qwen-Omni models is Base64-encoded data delivered in a stream. You can use a string variable to accumulate the Base64 data from each fragment as it arrives. After the stream is complete, decode the final string to create the audio file. Alternatively, decode and play each fragment in real time as it is received.
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
from openai import OpenAI
import base64
import numpy as np
import soundfile as sf

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[{"role": "user", "content": "Who are you"}],
  # Set the output data modality. Two are currently supported: ["text","audio"] and ["text"].
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  # stream must be set to True to avoid errors.
  stream=True,
  stream_options={"include_usage": True},
)

# Method 1: Decode after the generation is complete
audio_string = ""
for chunk in completion:
  if chunk.choices:
    if hasattr(chunk.choices[0].delta, "audio"):
      try:
        audio_string += chunk.choices[0].delta.audio["data"]
      except Exception as e:
        print(chunk.choices[0].delta.content)
  else:
    print(chunk.usage)

wav_bytes = base64.b64decode(audio_string)
audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
sf.write("audio_assistant_py.wav", audio_np, samplerate=24000)

# Method 2: Decode while generating (comment out the code for Method 1 to use Method 2)
# # Initialize PyAudio
# import pyaudio
# import time
# p = pyaudio.PyAudio()
# # Create an audio stream
# stream = p.open(format=pyaudio.paInt16,
#                 channels=1,
#                 rate=24000,
#                 output=True)

# for chunk in completion:
#     if chunk.choices:
#         if hasattr(chunk.choices[0].delta, "audio"):
#             try:
#                 audio_string = chunk.choices[0].delta.audio["data"]
#                 wav_bytes = base64.b64decode(audio_string)
#                 audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
#                 # Play the audio data directly
#                 stream.write(audio_np.tobytes())
#             except Exception as e:
#                 print(chunk.choices[0].delta.content)

# time.sleep(0.8)
# # Clean up resources
# stream.stop_stream()
# stream.close()
# p.terminate()

Input Base64-encoded local files

  • Images
  • Audio
  • Video file
  • Image list (as video)
This example uses the locally saved file eagle.png.
import os
from openai import OpenAI
import base64

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)


#  Base64 encoding format
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode("utf-8")


base64_image = encode_image("eagle.png")

completion = client.chat.completions.create(
  model="qwen3.5-omni-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {"url": f"data:image/png;base64,{base64_image}"},
        },
        {"type": "text", "text": "What scene is depicted in the image?"},
      ],
    },
  ],
  # Set the output data modality. Two are currently supported: ["text","audio"] and ["text"].
  modalities=["text", "audio"],
  audio={"voice": "Tina", "format": "wav"},
  # stream must be set to True to avoid errors.
  stream=True,
  stream_options={"include_usage": True},
)

for chunk in completion:
  if chunk.choices:
    print(chunk.choices[0].delta)
  else:
    print(chunk.usage)

API reference

For the input and output parameters of Qwen-Omni, see Chat completions API.

Billing and rate limits

Billing rules Qwen-Omni is billed based on the number of tokens for different modalities, such as audio, image, and video. See Pricing for pricing details.
Audio
  • Qwen3.5-Omni: Total tokens = Audio duration (in seconds) x 7
  • Qwen3-Omni-Flash: Total tokens = Audio duration (in seconds) x 12.5
  • Qwen-Omni-Turbo: Total tokens = Audio duration (in seconds) x 25. If the audio duration is less than 1 second, it is calculated as 1 second.
Images
  • Qwen3.5-Omni and Qwen3-Omni-Flash: 1 token per 32 x 32 pixels.
  • Qwen-Omni-Turbo: 1 token per 28 x 28 pixels.
Each image requires a minimum of 4 tokens and supports a maximum of 1280 tokens. You can use the following code to estimate the total tokens for an image by providing its path:
import math
# Install the Pillow library: pip install Pillow
from PIL import Image

# For Qwen-Omni-Turbo, the factor is 28.
# factor = 28
# For Qwen3.5-Omni and Qwen3-Omni-Flash, the factor is 32.
factor = 32

def token_calculate(image_path=''):
  """
  param image_path: The path of the image.
  return: The number of tokens for a single image.
  """
  if len(image_path) > 0:
    # Open the specified image file.
    image = Image.open(image_path)
    # Get the original dimensions of the image.
    height = image.height
    width = image.width
    print(f"Image dimensions before scaling: Height={height}, Width={width}")
    # Adjust the height to be a multiple of the factor.
    h_bar = round(height / factor) * factor
    # Adjust the width to be a multiple of the factor.
    w_bar = round(width / factor) * factor
    # Lower limit for image tokens: 4 tokens.
    min_pixels = 4 * factor * factor
    # Upper limit for image tokens: 1280 tokens.
    max_pixels = 1280 * factor * factor
    # Scale the image to adjust the total number of pixels to be within the [min_pixels, max_pixels] range.
    if h_bar * w_bar > max_pixels:
      # Calculate the scaling factor beta so that the total pixels of the scaled image do not exceed max_pixels.
      beta = math.sqrt((height * width) / max_pixels)
      # Recalculate the adjusted height to ensure it is a multiple of the factor.
      h_bar = math.floor(height / beta / factor) * factor
      # Recalculate the adjusted width to ensure it is a multiple of the factor.
      w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
      # Calculate the scaling factor beta so that the total pixels of the scaled image are not less than min_pixels.
      beta = math.sqrt(min_pixels / (height * width))
      # Recalculate the adjusted height to ensure it is a multiple of the factor.
      h_bar = math.ceil(height * beta / factor) * factor
      # Recalculate the adjusted width to ensure it is a multiple of the factor.
      w_bar = math.ceil(width * beta / factor) * factor
    print(f"Image dimensions after scaling: Height={h_bar}, Width={w_bar}")
    # Calculate the number of tokens for the image: total pixels / (factor * factor).
    token = int((h_bar * w_bar) / (factor * factor)) + 2
    print(f"Number of tokens after scaling: {token}")
    return token
  else:
    raise ValueError("Image path cannot be empty. Provide a vaild image file path")

if __name__ == "__main__":
  token = token_calculate(image_path="xxx/test.jpg")
VideoVideo files generate two types of tokens: video_tokens (visual) and audio_tokens (audio).
  • video_tokens
The calculation procedure is complex. For more information, see the following code:
# Before use, install: pip install opencv-python
import math
import os
import logging
import cv2

# Fixed parameters
FRAME_FACTOR = 2

# For Qwen3.5-Omni and Qwen3-Omni-Flash, IMAGE_FACTOR is 32
IMAGE_FACTOR = 32

# For Qwen-Omni-Turbo, IMAGE_FACTOR is 28
# IMAGE_FACTOR = 28

# Aspect ratio of video frames
MAX_RATIO = 200

# Lower limit for video frame pixels. For Qwen3.5-Omni and Qwen3-Omni-Flash: 128 * 32 * 32
VIDEO_MIN_PIXELS = 128 * 32 * 32
# For Qwen-Omni-Turbo
# VIDEO_MIN_PIXELS = 128 * 28 * 28

# Upper limit for video frame pixels. For Qwen3.5-Omni and Qwen3-Omni-Flash: 768 * 32 * 32
VIDEO_MAX_PIXELS = 768 * 32 * 32
# For Qwen-Omni-Turbo:
# VIDEO_MAX_PIXELS = 768 * 28 * 28

FPS = 2
# Minimum number of extracted frames
FPS_MIN_FRAMES = 4

# Maximum number of extracted frames
# Maximum number of extracted frames for Qwen3.5-Omni and Qwen3-Omni-Flash: 128
# Maximum number of extracted frames for Qwen-Omni-Turbo: 80
FPS_MAX_FRAMES = 128

# Maximum pixel value for video input. For Qwen3.5-Omni and Qwen3-Omni-Flash: 16384 * 32 * 32
VIDEO_TOTAL_PIXELS = 16384 * 32 * 32
# For Qwen-Omni-Turbo:
# VIDEO_TOTAL_PIXELS = 16384 * 28 * 28

def round_by_factor(number, factor):
  return round(number / factor) * factor

def ceil_by_factor(number, factor):
  return math.ceil(number / factor) * factor

def floor_by_factor(number, factor):
  return math.floor(number / factor) * factor

def get_video(video_path):
  cap = cv2.VideoCapture(video_path)
  frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
  frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
  total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
  video_fps = cap.get(cv2.CAP_PROP_FPS)
  cap.release()
  return frame_height, frame_width, total_frames, video_fps

def smart_nframes(total_frames, video_fps):
  min_frames = ceil_by_factor(FPS_MIN_FRAMES, FRAME_FACTOR)
  max_frames = floor_by_factor(min(FPS_MAX_FRAMES, total_frames), FRAME_FACTOR)
  duration = total_frames / video_fps if video_fps != 0 else 0
  if duration - int(duration) > (1 / FPS):
    total_frames = math.ceil(duration * video_fps)
  else:
    total_frames = math.ceil(int(duration) * video_fps)
  nframes = total_frames / video_fps * FPS
  nframes = int(min(min(max(nframes, min_frames), max_frames), total_frames))
  if not (FRAME_FACTOR <= nframes <= total_frames):
    raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
  return nframes

def smart_resize(height, width, nframes, factor=IMAGE_FACTOR):
  min_pixels = VIDEO_MIN_PIXELS
  total_pixels = VIDEO_TOTAL_PIXELS
  max_pixels = max(min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR), int(min_pixels * 1.05))
  if max(height, width) / min(height, width) > MAX_RATIO:
    raise ValueError(f"absolute aspect ratio must be smaller than {MAX_RATIO}, got {max(height, width) / min(height, width)}")
  h_bar = max(factor, round_by_factor(height, factor))
  w_bar = max(factor, round_by_factor(width, factor))
  if h_bar * w_bar > max_pixels:
    beta = math.sqrt((height * width) / max_pixels)
    h_bar = floor_by_factor(height / beta, factor)
    w_bar = floor_by_factor(width / beta, factor)
  elif h_bar * w_bar < min_pixels:
    beta = math.sqrt(min_pixels / (height * width))
    h_bar = ceil_by_factor(height * beta, factor)
    w_bar = ceil_by_factor(width * beta, factor)
  return h_bar, w_bar

def video_token_calculate(video_path):
  height, width, total_frames, video_fps = get_video(video_path)
  nframes = smart_nframes(total_frames, video_fps)
  resized_height, resized_width = smart_resize(height, width, nframes)
  video_token = int(math.ceil(nframes / FPS) * resized_height / 32 * resized_width / 32)
  video_token += 2  # Visual marks
  return video_token

if __name__ == "__main__":
  video_path = "spring_mountain.mp4"  # Your video path
  video_token = video_token_calculate(video_path)
  print("video_tokens:", video_token)
  • audio_tokens
    • Qwen3.5-Omni: Total tokens = Audio duration (in seconds) x 7
    • Qwen3-Omni-Flash: Total tokens = Audio duration (in seconds) x 12.5
    • Qwen-Omni-Turbo: Total tokens = Audio duration (in seconds) x 25
    • If the audio duration is less than 1 second, it is calculated as 1 second.
Free quota For more information about how to claim, query, and use your free quota, see Free quota for new users. Rate limits For model rate limit rules and FAQ, see Rate limits.

Error codes

If a call fails, see Error messages.

Voice list

To use a voice, set the voice request parameter to the corresponding value in the voice parameter column of the tables below.

qwen3.5-omni

Voice namevoice parameterDescriptionSupported languages
TinaTinaA voice like warm milk tea -- sweet and cozy, yet sharp when solving problemsChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
CindyCindyA sweet-talking young woman from TaiwanChinese (Taiwanese accent), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Liora MiraLiora MiraA gentle voice that weaves warmth into everyday lifeChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
SunnybobiSunnybobiA cheerful, socially awkward neighbor girlChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
RaymondRaymondA clear-voiced, takeout-loving homebodyChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
EthanEthanStandard Mandarin with a slight northern accent. Bright, warm, energetic, and youthfulChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Theo CalmTheo CalmConveys understanding in silence and healing through wordsChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
SerenaSerenaA gentle young womanChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
HarveyHarveyA voice that carries the weight of time -- deep, mellow, and scented with coffee and old booksChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
MaiaMaiaA blend of intellect and gentlenessChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
EvanEvanA college student -- youthful and endearingChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
QiaoQiaoNot just cute -- she's sweet on the surface and full of personality underneathChinese (Taiwanese accent), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
MomoMomoPlayful and mischievous -- here to cheer you upChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
WilWilA young man from Shenzhen who speaks with a Hong Kong-Taiwan accentChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
AngelAngelSlightly Taiwanese-accented -- and very sweetChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Li CassianLi CassianSpeaks with restraint -- three parts silence, seven parts reading the roomChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
MiaMiaA lifestyle artist who shares slow-living aesthetics through a soothing voiceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
JoynerJoynerFunny, exaggerated, and down-to-earthChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
GoldGoldA West Coast Black rapperChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
KaterinaKaterinaA mature, commanding voice with rich rhythm and resonanceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
RyanRyanHigh-energy delivery with strong dramatic presence -- realism meets intensityChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
JenniferJenniferA premium, cinematic-quality American female voiceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
AidenAidenAn American young man skilled in cookingChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
MioneMioneA mature, intelligent British neighbor girlChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Sichuan - SunnySunnyA sweet Sichuan girl who warms your heartChinese (Sichuan dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Beijing - DylanDylanA youth raised in Beijing's hutongsChinese (Beijing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Sichuan - EricEricA lively Chengdu man from SichuanChinese (Sichuan dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Tianjin - PeterPeterA Tianjin-style xiangsheng performer -- professional foilChinese (Tianjin dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Joseph ChenJoseph ChenA longtime overseas Chinese from Southeast Asia with a warm, nostalgic voiceChinese (Hokkien), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Shaanxi - MarcusMarcusBroad face, few words, sincere heart, deep voice -- the true flavor of ShaanxiChinese (Shaanxi dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Nanjing - LiLiA grumpy uncleChinese (Nanjing dialect), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Cantonese - RockyRockyA witty and humorous online chat companionChinese (Cantonese), English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
SoheeSoheeA warm, cheerful, emotionally expressive Korean unnieChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
LennLennRational at core, rebellious in detail -- a German youth who wears suits and listens to post-punkChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Ono AnnaOno AnnaA clever, playful childhood friendChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
SonrisaSonrisaA warm, outgoing Latin American womanChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
BodegaBodegaA warm, enthusiastic Spanish manChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
EmilienEmilienA romantic French big brotherChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
AndreAndreA magnetic, natural, and steady male voiceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
Radio GolRadio GolA passionate football commentator who narrates games with poetic flairChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
AlekAlekCold like the Russian spirit -- yet warm as wool beneath a coatChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
RizkyRizkyA young Indonesian man with a distinctive voiceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
RoyaRoyaA sporty girl with a free-spirited heartChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
ArdaArdaNeither high nor low -- clean, crisp, and gently warmChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
HanaHanaA mature Vietnamese woman who loves dogsChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
DolceDolceA laid-back Italian manChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
JakubJakubA charismatic, artistic young man from a Polish townChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
GrietGrietA mature, artistic Dutch womanChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
EliskaEliskaEvery word carries Central European craftsmanship and warmthChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
MarinaMarinaA girl raised in a multicultural cityChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
SiiriSiiriReserved and gentle -- with a calm, lake-like speaking paceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
IngridIngridA woman from rural NorwayChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
SiggaSiggaAn intellectual young woman from an Icelandic townChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
BeaBeaA sweet Filipino woman who loves coffeeChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian
ChloeChloeA Malaysian office workerChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Tagalog, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian

qwen3-omni-flash-2025-12-01

Voice namevoice parameterDescriptionSupported languages
CherryCherryA sunny, positive, friendly, and natural young womanChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
SerenaSerenaA gentle young womanChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
EthanEthanStandard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrantChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
ChelsieChelsieA two-dimensional virtual girlfriendChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
MomoMomoPlayful and mischievous, cheering you upChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
VivianVivianConfident, cute, and slightly feistyChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
MoonMoonEffortlessly cool Moon WhiteChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
MaiaMaiaA blend of intellect and gentlenessChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
KaiKaiA soothing audio spa for your earsChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
NofishNofishA designer who cannot pronounce retroflex soundsChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
BellaBellaA little girl who drinks but never throws punches when drunkChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
JenniferJenniferA premium, cinematic-quality American English female voiceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
RyanRyanFull of rhythm, bursting with dramatic flair, balancing authenticity and tensionChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
KaterinaKaterinaA mature-woman voice with rich, memorable rhythmChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
AidenAidenAn American English young man skilled in cookingChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
MiaMiaGentle as spring water, obedient as fresh snowChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
MochiMochiA clever, quick-witted young adult -- childlike innocence remains, yet wisdom shines throughChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
BellonaBellonaA powerful, clear voice that brings characters to life -- so stirring it makes your blood boilChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
VincentVincentA uniquely raspy, smoky voice -- just one line evokes armies and heroic talesChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
BunnyBunnyA little girl overflowing with cutenessChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
NeilNeilA flat baseline intonation with precise, clear pronunciation -- the most professional news anchorChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
EliasEliasMaintains academic rigor while using storytelling techniques to turn complex knowledge into digestible learning modulesChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
ArthurArthurA simple, earthy voice steeped in time and tobacco smoke -- slowly unfolding village stories and curiositiesChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
NiniNiniA soft, clingy voice like sweet rice cakesChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
EbonaEbonaHer whisper is like a rusty key slowly turning in the darkest corner of your mindChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
SerenSerenA gentle, soothing voice to help you fall asleep faster. Good night, sweet dreamsChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
PipPipA playful, mischievous boy full of childlike wonderChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
StellaStellaNormally a cloyingly sweet, dazed teenage-girl voice -- but when shouting battle cries, she instantly radiates unwavering love and justiceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
BodegaBodegaA passionate Spanish manChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
SonrisaSonrisaA cheerful, outgoing Latin American womanChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
AlekAlekCold like the Russian spirit, yet warm like wool coat liningChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
DolceDolceA laid-back Italian manChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
SoheeSoheeA warm, cheerful, emotionally expressive Korean unnieChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
LennLennRational at heart, rebellious in detail -- a German youth who wears suits and listens to post-punkChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
EmilienEmilienA romantic French big brotherChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
AndreAndreA magnetic, natural, and steady male voiceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai - JadaJadaA fast-paced, energetic Shanghai auntieShanghainese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Beijing - DylanDylanA young man raised in Beijing's hutongsBeijing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sichuan - SunnySunnyA Sichuan girl sweet enough to melt your heartSichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nanjing - LiLiA patient yoga teacherNanjing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shaanxi - MarcusMarcusA man with a broad face, few words, a sincere heart, and deep rootsShaanxi dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Southern Min - RoyRoyA humorous, straightforward, lively Taiwanese manSouthern Min, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Tianjin - PeterPeterA Tianjin-style crosstalk performer and professional food criticTianjin dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Cantonese - RockyRockyA humorous, witty man providing live commentaryCantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Cantonese - KikiKikiA sweet Hong Kong girl best friendCantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sichuan - EricEricA Sichuanese man from Chengdu who stands out in every crowdSichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

qwen3-omni-flash and qwen3-omni-flash-2025-09-15

Voice namevoice parameterDescriptionSupported languages
CherryCherryA sunny, positive, friendly, and natural young womanChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
EthanEthanStandard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrantChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
NofishNofishA designer who cannot pronounce retroflex soundsChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
JenniferJenniferA premium, cinematic-quality American English female voiceChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
RyanRyanFull of rhythm, bursting with dramatic flair, balancing authenticity and tensionChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
KaterinaKaterinaA mature-woman voice with rich, memorable rhythmChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
EliasEliasMaintains academic rigor while using storytelling techniques to turn complex knowledge into digestible learning modulesChinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai - JadaJadaA fast-paced, energetic Shanghai auntieShanghainese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Beijing - DylanDylanA young man raised in Beijing's hutongsBeijing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sichuan - SunnySunnyA Sichuan girl sweet enough to melt your heartSichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nanjing - LiLiA patient yoga teacherNanjing dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shaanxi - MarcusMarcusA man with a broad face, few words, a sincere heart, and deep rootsShaanxi dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Southern Min - RoyRoyA humorous, straightforward, lively Taiwanese manSouthern Min, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Tianjin - PeterPeterA Tianjin-style crosstalk performer and professional food criticTianjin dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Cantonese - RockyRockyA humorous, witty man providing live commentaryCantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Cantonese - KikiKikiA sweet Hong Kong girl best friendCantonese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Sichuan - EricEricA Sichuanese man from Chengdu who stands out in every crowdSichuan dialect, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Qwen-Omni-Turbo

Voice namevoice parameterDescriptionSupported languages
CherryCherryA sunny, positive, friendly, and natural young womanChinese, English
SerenaSerenaA gentle young womanChinese, English
EthanEthanStandard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrantChinese, English
ChelsieChelsieA two-dimensional virtual girlfriendChinese, English

Open-source Qwen-Omni models

Voice namevoice parameterDescriptionSupported languages
EthanEthanStandard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrantChinese, English
ChelsieChelsieA two-dimensional virtual girlfriendChinese, English