Skip to main content
Speech-to-text

Audio understanding

Describe complex audio

Qwen3-Omni-Captioner is an open-source model built on Qwen3-Omni. It generates descriptions for complex audio, including speech, ambient sounds, music, and sound effects, without requiring prompts. The model can identify speaker emotions, musical elements like style and instruments, and sensitive information.

Availability

ModelContext windowMax inputMax outputInput costOutput costFree quota (Note)
qwen3-omni-30b-a3b-captioner65,53632,76832,768$3.81$3.061 million tokens. Valid for 90 days after activating Qwen Cloud
Token conversion rule for audio: Total tokens = Audio duration (in seconds) × 12.5. If the audio duration is less than one second, it is counted as one second.

Getting started

Prerequisites Qwen3-Omni-Captioner only supports API calls. Online testing is not available for this model. The following examples analyze online audio specified by a URL. For local files, see how to pass local files. For file requirements, see Limitations.
  • OpenAI compatible
  • DashScope
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-30b-a3b-captioner",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]
)
print(completion.choices[0].message.content)
{
  "choices": [
    {
      "message": {
        "content": "The audio clip is a brief, low-fidelity recording-approximately six seconds long-captured in a small, reverberant indoor space, likely a home office or bedroom. It opens with a rapid, metallic, rhythmic hammering sound, repeating every 0.5 to 0.6 seconds, with each strike slightly uneven and accompanied by a short echo. This sound dominates the left side of the stereo field and is close to the microphone, suggesting the hammering is occurring nearby and slightly to the left.\n\nOverlaid with the hammering, a single male voice speaks in Mandarin Chinese, his tone clearly one of frustration and exasperation. He says, \"Oh, with this, how am I supposed to work quietly?\" His speech is clear despite the poor audio quality, and is delivered in a standard, unaccented Mandarin, indicative of a native speaker from northern or central China.\n\nThe voice is more distant and centered in the stereo field, with more room reverberation than the hammering. The emotional content is palpable: his voice rises slightly at the end, turning the phrase into a rhetorical complaint, underscoring his irritation. No other voices, music, or ambient sounds are present; the only non-speech sounds are the hammering and the faint hiss of the recording device.\n\nThe combination of the environmental sound, the speaker's language, and his tone strongly suggests a scenario of home office disruption-perhaps someone working from home is being disturbed by renovation or repair work happening nearby. The recording ends abruptly, mid-hammer, further emphasizing the spontaneous and candid nature of the capture.\n\nIn summary, the audio is a realistic, low-fidelity snapshot of a Mandarin-speaking man, likely in China, expressing frustration at being unable to work in peace due to nearby construction or repair activity, captured in a personal, indoor setting.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 160,
    "completion_tokens": 387,
    "total_tokens": 547,
    "prompt_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "completion_tokens_details": {
      "text_tokens": 387
    }
  },
  "created": 1758002134,
  "system_fingerprint": null,
  "model": "qwen3-omni-30b-a3b-captioner",
  "id": "chatcmpl-f4155bf9-b860-49d6-8ee2-092da7359097"
}

How it works

  • Single-turn interaction: The model does not support multi-turn conversation. Each request is an independent analysis task.
  • Fixed task: The model's core task is to generate audio descriptions in English only. You cannot use instructions, such as a system message, to change its behavior, such as controlling the output format or content focus.
  • Audio input only: The model accepts only audio as input. You do not need to pass text prompts. The format of the message parameter is fixed.
OpenAI compatible:
messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]
DashScope:
messages = [
  {
    "role": "user",
    "content": [
      {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
    ]
  }
]

Streaming output

For general streaming concepts (SSE protocol, how to enable streaming, billing, and token usage), see Streaming output. This section covers only the streaming behavior specific to audio understanding.
To enable streaming, add stream: true to your call. The streaming behavior is identical to standard text streaming — only the input message format (audio instead of text) differs. Use the same message format shown in Getting started and add the streaming parameters:
completion = client.chat.completions.create(
  model="qwen3-omni-30b-a3b-captioner",
  messages=[{
    "role": "user",
    "content": [{"type": "input_audio", "input_audio": {"data": "<audio-url>"}}]
  }],
  stream=True,
  stream_options={"include_usage": True},
)
for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")

Pass local file (Base64 encoding or file path)

The model provides two methods to upload a local file:
  • Upload using Base64 encoding
  • Direct file path (Recommended for more stable transmission)
Upload methods:
  • Pass by file path
  • Pass by Base64 encoding
Pass the file path directly to the model. This method is supported only by the DashScope Python and Java SDKs, not by HTTP. Refer to the following table to specify the file path based on your programming language and operating system.

Specify the file path

SystemSDKInput file pathExample
Linux or macOSPython SDKfile://<absolute_path_of_the_file>file:///home/images/test.mp3
Linux or macOSJava SDKfile://<absolute_path_of_the_file>file:///home/images/test.mp3
WindowsPython SDKfile://<absolute_path_of_the_file>file://D:/images/test.mp3
WindowsJava SDKfile:///<absolute_path_of_the_file>file:///D:/images/test.mp3
Limits:
  • We recommend passing the file path directly for greater stability. You can also use Base64 encoding for files smaller than 1 MB.
  • When passing a file path directly, the audio file must be smaller than 10 MB.
  • When passing a file using Base64 encoding, the encoded string must be smaller than 10 MB. Base64 encoding increases the data size.
Code samples
  • Pass by file path
  • Pass by Base64 encoding
Passing a file path is supported only by the DashScope Python and Java SDKs, not by HTTP.
import dashscope
import os

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
# The full path of the local file must be prefixed with file:// to ensure a valid path, for example: file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
  {
    "role": "user",
    # Pass the file path prefixed with file:// in the audio parameter.
    "content": [{"audio": audio_file_path}],
  }
]

response = dashscope.MultiModalConversation.call(
  # If you have not configured the environment variable, replace the following line with your API key: api_key="sk-xxx"
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model="qwen3-omni-30b-a3b-captioner",
  messages=messages)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

API reference

For the input and output parameters of Qwen3-Omni-Captioner, see Chat completions API.

Error codes

If a call fails, see Error messages.

FAQ

How to compress an audio file to the required size?

  • Online tools: You can use online tools such as Compresss to compress audio files.
  • Code implementation: You can use the FFmpeg tool. For more information about its usage, see the official FFmpeg website.
# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: input.mp3

# -b:a: Sets the audio bitrate.
    # Common values: 64 kbps (low quality, for voice and low-bandwidth streaming), 128k (medium quality, for general audio and podcasts), 192 kbps (high quality, for music and broadcasting).
    # A higher bitrate results in better audio quality and a larger file size.

# -ar: Sets the audio sample rate, which is the number of samples per second.
  # Common values: 8000 Hz, 22050 Hz, 44100 Hz (standard sample rate).
  # A higher sample rate results in a larger file size.

# -ac: Sets the number of audio channels. Common values: 1 (mono), 2 (stereo). Mono files are smaller.

# -y: Overwrites the output file if it exists (no value needed). # output.mp3: Specifies the output file path.

ffmpeg -i input.mp3 -b:a 128k -ar 44100 -ac 1 output.mp3 -y

Limitations

The model has the following limits for audio files:
  1. Duration: Less than or equal to 40 minutes.
  2. Number of files: Only one audio file is supported per request.
  3. File formats: Supported formats include AMR, WAV (CodecID: GSM_MS), WAV (PCM), 3GP, 3GPP, AAC, and MP3.
  4. File input methods: Publicly accessible audio URL, Base64 encoding, or local file path.
  5. File size:
    • Public URL: No more than 1 GB.
    • File path: The audio file must be smaller than 10 MB.
    • Base64 encoding: The encoded Base64 string must be smaller than 10 MB. For more information, see Pass local file.

Alternative: Use Qwen-Omni

You can also use Qwen-Omni (qwen3-omni-flash) with a prompt for audio understanding. Unlike Qwen3-Omni-Captioner which generates descriptions without prompts, Qwen-Omni allows you to ask specific questions about the audio.
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-flash",
  messages=[{
    "role": "user",
    "content": [
      {"type": "input_audio", "input_audio": {"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav"}},
      {"type": "text", "text": "What is being said in this audio? Describe the speaker's emotion."}
    ]
  }],
  modalities=["text"],
  stream=True,
)

for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")
For full Qwen-Omni capabilities including multimodal conversation with audio output, see Audio and video file understanding.