Describe complex audio
Qwen3-Omni-Captioner is an open-source model built on Qwen3-Omni. It generates descriptions for complex audio, including speech, ambient sounds, music, and sound effects, without requiring prompts. The model can identify speaker emotions, musical elements like style and instruments, and sensitive information.
Prerequisites
To enable streaming, add
The model provides two methods to upload a local file:
Limits:
For the input and output parameters of Qwen3-Omni-Captioner, see Chat completions API.
If a call fails, see Error messages.
The model has the following limits for audio files:
You can also use Qwen-Omni (
Availability
| Model | Context window | Max input | Max output | Input cost | Output cost | Free quota (Note) |
|---|---|---|---|---|---|---|
| qwen3-omni-30b-a3b-captioner | 65,536 | 32,768 | 32,768 | $3.81 | $3.06 | 1 million tokens. Valid for 90 days after activating Qwen Cloud |
Token conversion rule for audio:
Total tokens = Audio duration (in seconds) × 12.5. If the audio duration is less than one second, it is counted as one second.Getting started
Prerequisites
- Get an API key and export it as an environment variable.
- If you use an SDK to make calls, install the latest version of the SDK.
- OpenAI compatible
- DashScope
Full JSON response
Full JSON response
How it works
- Single-turn interaction: The model does not support multi-turn conversation. Each request is an independent analysis task.
- Fixed task: The model's core task is to generate audio descriptions in English only. You cannot use instructions, such as a system message, to change its behavior, such as controlling the output format or content focus.
- Audio input only: The model accepts only audio as input. You do not need to pass text prompts. The format of the
messageparameter is fixed.
Example message format
Example message format
OpenAI compatible:DashScope:
Streaming output
For general streaming concepts (SSE protocol, how to enable streaming, billing, and token usage), see Streaming output. This section covers only the streaming behavior specific to audio understanding.
stream: true to your call. The streaming behavior is identical to standard text streaming — only the input message format (audio instead of text) differs. Use the same message format shown in Getting started and add the streaming parameters:
Pass local file (Base64 encoding or file path)
The model provides two methods to upload a local file:
- Upload using Base64 encoding
- Direct file path (Recommended for more stable transmission)
- Pass by file path
- Pass by Base64 encoding
Pass the file path directly to the model. This method is supported only by the DashScope Python and Java SDKs, not by HTTP. Refer to the following table to specify the file path based on your programming language and operating system.
Specify the file path
Specify the file path
| System | SDK | Input file path | Example |
|---|---|---|---|
| Linux or macOS | Python SDK | file://<absolute_path_of_the_file> | file:///home/images/test.mp3 |
| Linux or macOS | Java SDK | file://<absolute_path_of_the_file> | file:///home/images/test.mp3 |
| Windows | Python SDK | file://<absolute_path_of_the_file> | file://D:/images/test.mp3 |
| Windows | Java SDK | file:///<absolute_path_of_the_file> | file:///D:/images/test.mp3 |
- We recommend passing the file path directly for greater stability. You can also use Base64 encoding for files smaller than 1 MB.
- When passing a file path directly, the audio file must be smaller than 10 MB.
- When passing a file using Base64 encoding, the encoded string must be smaller than 10 MB. Base64 encoding increases the data size.
- Pass by file path
- Pass by Base64 encoding
Passing a file path is supported only by the DashScope Python and Java SDKs, not by HTTP.
API reference
For the input and output parameters of Qwen3-Omni-Captioner, see Chat completions API.
Error codes
If a call fails, see Error messages.
FAQ
How to compress an audio file to the required size?
How to compress an audio file to the required size?
- Online tools: You can use online tools such as Compresss to compress audio files.
- Code implementation: You can use the FFmpeg tool. For more information about its usage, see the official FFmpeg website.
Limitations
The model has the following limits for audio files:
- Duration: Less than or equal to 40 minutes.
- Number of files: Only one audio file is supported per request.
- File formats: Supported formats include AMR, WAV (CodecID: GSM_MS), WAV (PCM), 3GP, 3GPP, AAC, and MP3.
- File input methods: Publicly accessible audio URL, Base64 encoding, or local file path.
- File size:
- Public URL: No more than 1 GB.
- File path: The audio file must be smaller than 10 MB.
- Base64 encoding: The encoded Base64 string must be smaller than 10 MB. For more information, see Pass local file.
To compress a file, see How to compress an audio file to the required size?
Alternative: Use Qwen-Omni
You can also use Qwen-Omni (qwen3-omni-flash) with a prompt for audio understanding. Unlike Qwen3-Omni-Captioner which generates descriptions without prompts, Qwen-Omni allows you to ask specific questions about the audio.
For full Qwen-Omni capabilities including multimodal conversation with audio output, see Audio and video file understanding.