Audio understanding - Qwen Cloud

Qwen3-Omni-Captioner is an open-source model built on Qwen3-Omni. It generates descriptions for complex audio, including speech, ambient sounds, music, and sound effects, without requiring prompts. The model can identify speaker emotions, musical elements like style and instruments, and sensitive information.

Availability

Model	Context window	Max input	Max output	Input cost	Output cost	Free quota (Note)
qwen3-omni-30b-a3b-captioner	65,536	32,768	32,768	$3.81	$3.06	1 million tokens. Valid for 90 days after activating Qwen Cloud

Token conversion rule for audio: Total tokens = Audio duration (in seconds) × 12.5. If the audio duration is less than one second, it is counted as one second.

Getting started

Prerequisites

Get an API key and export it as an environment variable.
If you use an SDK to make calls, install the latest version of the SDK.

Qwen3-Omni-Captioner only supports API calls. Online testing is not available for this model. The following examples analyze online audio specified by a URL. For local files, see how to pass local files. For file requirements, see Limitations.

OpenAI compatible
DashScope

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-30b-a3b-captioner",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]
)
print(completion.choices[0].message.content)

Full JSON response

{
  "choices": [
    {
      "message": {
        "content": "The audio clip is a brief, low-fidelity recording-approximately six seconds long-captured in a small, reverberant indoor space, likely a home office or bedroom. It opens with a rapid, metallic, rhythmic hammering sound, repeating every 0.5 to 0.6 seconds, with each strike slightly uneven and accompanied by a short echo. This sound dominates the left side of the stereo field and is close to the microphone, suggesting the hammering is occurring nearby and slightly to the left.\n\nOverlaid with the hammering, a single male voice speaks in Mandarin Chinese, his tone clearly one of frustration and exasperation. He says, \"Oh, with this, how am I supposed to work quietly?\" His speech is clear despite the poor audio quality, and is delivered in a standard, unaccented Mandarin, indicative of a native speaker from northern or central China.\n\nThe voice is more distant and centered in the stereo field, with more room reverberation than the hammering. The emotional content is palpable: his voice rises slightly at the end, turning the phrase into a rhetorical complaint, underscoring his irritation. No other voices, music, or ambient sounds are present; the only non-speech sounds are the hammering and the faint hiss of the recording device.\n\nThe combination of the environmental sound, the speaker's language, and his tone strongly suggests a scenario of home office disruption-perhaps someone working from home is being disturbed by renovation or repair work happening nearby. The recording ends abruptly, mid-hammer, further emphasizing the spontaneous and candid nature of the capture.\n\nIn summary, the audio is a realistic, low-fidelity snapshot of a Mandarin-speaking man, likely in China, expressing frustration at being unable to work in peace due to nearby construction or repair activity, captured in a personal, indoor setting.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 160,
    "completion_tokens": 387,
    "total_tokens": 547,
    "prompt_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "completion_tokens_details": {
      "text_tokens": 387
    }
  },
  "created": 1758002134,
  "system_fingerprint": null,
  "model": "qwen3-omni-30b-a3b-captioner",
  "id": "chatcmpl-f4155bf9-b860-49d6-8ee2-092da7359097"
}

import dashscope
import os

dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"

messages = [
  {
    "role": "user",
    "content": [
      {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
    ]
  }
]

response = dashscope.MultiModalConversation.call(
  # If you have not configured the environment variable, replace the following line with your API key: api_key="sk-xxx",
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model="qwen3-omni-30b-a3b-captioner",
  messages=messages
)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

Full JSON response

{
  "output":{
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "The audio clip is a 6-second, high-fidelity recording set in a quiet, indoor environment. The primary sound is a male speaker, likely in his late teens to mid-20s, speaking Mandarin Chinese in a tone of mild exasperation. His speech is clear and natural, delivered in a conversational manner: \"Oh, how can I possibly work quietly like this?\" His voice is close to the microphone, and the room is acoustically neutral, with no noticeable echo or background noise, suggesting a small, well-furnished space.\n\nOverlaying the speech is a persistent, rhythmic mechanical sound-a series of sharp, metallic clicks or clatters that repeat every 0.6 seconds. The sound is dry and lacks any reverberation, further supporting the inference that it is produced by a mechanical device very close to the microphone. The regularity and timbre of the sound suggest a small, metallic object (such as a key, coin, or pen) being repeatedly tapped or struck on a hard surface, rather than a larger or more complex machine.\n\nThe speaker's complaint is a direct response to the mechanical noise, expressing frustration at being unable to concentrate or work in peace due to the disturbance. The tone is not angry or urgent, but rather one of resigned annoyance, typical of someone encountering a minor, persistent annoyance in a personal or domestic setting.\n\nThere are no other voices, music, or environmental cues present. The overall impression is of a brief, candid moment-perhaps a student, office worker, or someone in a quiet home environment-caught on microphone while complaining (to themselves or a nearby companion) about a distracting, repetitive noise. The recording is technically clean and focused, with all attention on the speaker and the mechanical sound, making it highly plausible that the clip was captured intentionally, possibly for a voice note, social media post, or as a sample for a sound effect library."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "input_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "total_tokens": 559,
    "output_tokens": 399,
    "input_tokens": 160,
    "output_tokens_details": {
      "text_tokens": 399
    }
  },
  "request_id": "d532f72c-e75b-4ffb-a1ef-d2465e758958"
}

How it works

Single-turn interaction: The model does not support multi-turn conversation. Each request is an independent analysis task.
Fixed task: The model's core task is to generate audio descriptions in English only. You cannot use instructions, such as a system message, to change its behavior, such as controlling the output format or content focus.
Audio input only: The model accepts only audio as input. You do not need to pass text prompts. The format of the message parameter is fixed.

Example message format

OpenAI compatible:

messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]

DashScope:

messages = [
  {
    "role": "user",
    "content": [
      {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
    ]
  }
]

Streaming output

For general streaming concepts (SSE protocol, how to enable streaming, billing, and token usage), see Streaming output. This section covers only the streaming behavior specific to audio understanding.

To enable streaming, add stream: true to your call. The streaming behavior is identical to standard text streaming — only the input message format (audio instead of text) differs. Use the same message format shown in Getting started and add the streaming parameters:

completion = client.chat.completions.create(
  model="qwen3-omni-30b-a3b-captioner",
  messages=[{
    "role": "user",
    "content": [{"type": "input_audio", "input_audio": {"data": "<audio-url>"}}]
  }],
  stream=True,
  stream_options={"include_usage": True},
)
for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")

Pass local file (Base64 encoding or file path)

The model provides two methods to upload a local file:

Upload using Base64 encoding
Direct file path (Recommended for more stable transmission)

Upload methods:

Pass by file path
Pass by Base64 encoding

Pass the file path directly to the model. This method is supported only by the DashScope Python and Java SDKs, not by HTTP. Refer to the following table to specify the file path based on your programming language and operating system.

Specify the file path

System	SDK	Input file path	Example
Linux or macOS	Python SDK	`file://<absolute_path_of_the_file>`	`file:///home/images/test.mp3`
Linux or macOS	Java SDK	`file://<absolute_path_of_the_file>`	`file:///home/images/test.mp3`
Windows	Python SDK	`file://<absolute_path_of_the_file>`	`file://D:/images/test.mp3`
Windows	Java SDK	`file:///<absolute_path_of_the_file>`	`file:///D:/images/test.mp3`

Convert the file to a Base64-encoded string and then pass it to the model.

Steps to pass a Base64-encoded string

Encode the file

Convert the local audio file to a Base64 string.

Example: Converting an audio file to a Base64 string

import base64

# Encoding function: Converts a local file to a Base64-encoded string
def encode_audio(audio_path):
  with open(audio_path, "rb") as audio_file:
    return base64.b64encode(audio_file.read()).decode("utf-8")

# Replace xxxx/test.mp3 with the absolute path of your local audio file
base64_audio = encode_audio("xxxx/test.mp3")

Construct a Data URL

Construct a Data URL in the following format: data:;base64,{base64_audio}, where base64_audio is the Base64 string generated in the previous step.

Invoke the model

Pass the Data URL using the audio (DashScope SDK) or input_audio (OpenAI SDK) parameter.

Limits:

We recommend passing the file path directly for greater stability. You can also use Base64 encoding for files smaller than 1 MB.
When passing a file path directly, the audio file must be smaller than 10 MB.
When passing a file using Base64 encoding, the encoded string must be smaller than 10 MB. Base64 encoding increases the data size.

Code samples

Pass by file path
Pass by Base64 encoding

Passing a file path is supported only by the DashScope Python and Java SDKs, not by HTTP.

import dashscope
import os

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
# The full path of the local file must be prefixed with file:// to ensure a valid path, for example: file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
  {
    "role": "user",
    # Pass the file path prefixed with file:// in the audio parameter.
    "content": [{"audio": audio_file_path}],
  }
]

response = dashscope.MultiModalConversation.call(
  # If you have not configured the environment variable, replace the following line with your API key: api_key="sk-xxx"
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model="qwen3-omni-30b-a3b-captioner",
  messages=messages)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

OpenAI compatible
DashScope

import os
from openai import OpenAI
import base64

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

def encode_audio(audio_path):
  with open(audio_path, "rb") as audio_file:
    return base64.b64encode(audio_file.read()).decode("utf-8")


# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

completion = client.chat.completions.create(
  model="qwen3-omni-30b-a3b-captioner",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            # When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
            # The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
            "data": f"data:;base64,{base64_audio}"
          },
        }
      ],
    },
  ]
)
print(completion.choices[0].message.content)

import os
import base64
import dashscope

dashscope.base_http_api_url="https://dashscope-intl.aliyuncs.com/api/v1"
# Encoding function: Converts a local file to a Base64-encoded string
def encode_audio(audio_file_path):
  with open(audio_file_path, "rb") as audio_file:
    return base64.b64encode(audio_file.read()).decode("utf-8")

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path of your local audio file.
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)
print(base64_audio)

messages = [
  {
    "role": "user",
    # When passing a local file with Base64 encoding, you must use the data: prefix to ensure a valid file URL.
    # The "base64" keyword must be included before the Base64-encoded data (base64_audio), otherwise an error will occur.
    "content": [{"audio":f"data:;base64,{base64_audio}"}],
  }
]

response = dashscope.MultiModalConversation.call(
  # If you have not configured the environment variable, replace the following line with your API key: api_key="sk-xxx"
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-omni-30b-a3b-captioner",
  messages=messages,
  )
print(response.output.choices[0].message.content[0]["text"])

API reference

For the input and output parameters of Qwen3-Omni-Captioner, see Chat completions API.

Error codes

If a call fails, see Error messages.

FAQ

How to compress an audio file to the required size?

Online tools: You can use online tools such as Compresss to compress audio files.
Code implementation: You can use the FFmpeg tool. For more information about its usage, see the official FFmpeg website.

# Basic conversion command (universal template)
# -i: Specifies the input file path. Example: input.mp3

# -b:a: Sets the audio bitrate.
    # Common values: 64 kbps (low quality, for voice and low-bandwidth streaming), 128k (medium quality, for general audio and podcasts), 192 kbps (high quality, for music and broadcasting).
    # A higher bitrate results in better audio quality and a larger file size.

# -ar: Sets the audio sample rate, which is the number of samples per second.
  # Common values: 8000 Hz, 22050 Hz, 44100 Hz (standard sample rate).
  # A higher sample rate results in a larger file size.

# -ac: Sets the number of audio channels. Common values: 1 (mono), 2 (stereo). Mono files are smaller.

# -y: Overwrites the output file if it exists (no value needed). # output.mp3: Specifies the output file path.

ffmpeg -i input.mp3 -b:a 128k -ar 44100 -ac 1 output.mp3 -y

Limitations

The model has the following limits for audio files:

Duration: Less than or equal to 40 minutes.
Number of files: Only one audio file is supported per request.
File formats: Supported formats include AMR, WAV (CodecID: GSM_MS), WAV (PCM), 3GP, 3GPP, AAC, and MP3.
File input methods: Publicly accessible audio URL, Base64 encoding, or local file path.
File size:
- Public URL: No more than 1 GB.
- File path: The audio file must be smaller than 10 MB.
- Base64 encoding: The encoded Base64 string must be smaller than 10 MB. For more information, see Pass local file.
To compress a file, see How to compress an audio file to the required size?

Alternative: Use Qwen-Omni

You can also use Qwen-Omni (qwen3-omni-flash) with a prompt for audio understanding. Unlike Qwen3-Omni-Captioner which generates descriptions without prompts, Qwen-Omni allows you to ask specific questions about the audio.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-flash",
  messages=[{
    "role": "user",
    "content": [
      {"type": "input_audio", "input_audio": {"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav"}},
      {"type": "text", "text": "What is being said in this audio? Describe the speaker's emotion."}
    ]
  }],
  modalities=["text"],
  stream=True,
)

for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")

For full Qwen-Omni capabilities including multimodal conversation with audio output, see Audio and video file understanding.

​Availability

​Getting started

​How it works

​Streaming output

​Pass local file (Base64 encoding or file path)

​API reference

​Error codes

​FAQ

​Limitations

​Alternative: Use Qwen-Omni

Availability

Getting started

How it works

Streaming output

Pass local file (Base64 encoding or file path)

API reference

Error codes

FAQ

Limitations

Alternative: Use Qwen-Omni