Convert files to text
The Fun-ASR audio file recognition models convert recorded audio into text. They support single-file and batch transcription, ideal for use cases that do not require real-time results, such as meeting transcription, post-call analytics, and caption generation.
Qwen Cloud also offers Qwen-ASR for recognition with enhanced semantic understanding and Qwen-Omni for prompt-based transcription with contextual understanding.
Supported model: Only
To produce the corrected result, include any of the following in the context:
For model availability, supported languages, and feature comparison, see Speech-to-text models.
Core features
- Multilingual recognition: Recognizes Chinese (including multiple dialects), English, Japanese, Korean, German, French, Russian, and 30+ other languages.
- Format compatibility: Accepts any sample rate and supports major audio and video formats, including AAC, WAV, and MP3.
- Long audio file processing: Handles asynchronous transcription for a single audio file up to 12 hours long and 2 GB in size. If speaker diarization is enabled, audio longer than 2 hours is not recommended.
- Singing voice recognition: Transcribes entire songs, even with background music (BGM). Only the fun-asr and fun-asr-2025-11-07 models support this feature.
- Recognition features: Configurable features include speaker diarization, sensitive word filtering, sentence-level and word-level timestamps, and hotword enhancement.
Supported models
- Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25, fun-asr is recommended), fun-asr-mtl-2025-08-25 (snapshot)
- Fun-ASR-Flash: fun-asr-flash-2026-06-15 (snapshot). Supports synchronous calls (up to 5 minutes) and context enhancement for improved accuracy on proper nouns.
Getting started
- Fun-ASR
- Qwen-ASR
- Qwen-Omni
Model availability
| Model | Version | Unit price | Free quota (Note) |
|---|---|---|---|
| fun-asr Currently, fun-asr-2025-11-07 | Stable | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-2025-11-07 Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-2025-08-25 | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-mtl Currently, fun-asr-mtl-2025-08-25 | Stable | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-mtl-2025-08-25 | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-flash-2026-06-15 Supports synchronous calls (up to 5 minutes) and context enhancement | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
-
Supported languages:
- fun-asr, fun-asr-2025-11-07, fun-asr-mtl, and fun-asr-mtl-2025-08-25: 30 languages
- fun-asr-2025-08-25: Mandarin and English.
- Sample rates supported: Any
- Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Make your first call
Get an API key and set it as an environment variable. To use the SDK, install it.Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.Async submit and sync wait
Submit a task and block until done.The complete recognition result is printed to the console in JSON format. The result includes the transcribed text and the start and end times of the text in the audio or video file, specified in milliseconds.
The complete recognition result is printed to the console in JSON format. The result includes the transcribed text and the start and end times of the text in the audio or video file, specified in milliseconds.
First resultSecond result
Async submit and async query
Submit a task and poll for results instead of blocking.RESTful API
Use any HTTP library to submit tasks and poll for results. This Python sample demonstrates the workflow:Synchronous calls (fun-asr-flash-2026-06-15)
fun-asr-flash-2026-06-15 supports synchronous calls for audio files up to 5 minutes long. Results can be returned in streaming or non-streaming mode.Context enhancement
Supported model: Only fun-asr-flash-2026-06-15 supports context enhancement.
Use case: Designed for scenarios that combine ASR with a large language model. Passing previous conversation context (LLM replies and earlier recognition results) into the ASR model significantly improves transcription accuracy for proper nouns such as names, locations, and product terms — more flexible than traditional hotwords.
Usage: Pass the conversation history through input.messages. Use the assistant role for the LLM's previous replies and the user role with input_text type for earlier recognition results. Context pairs must appear before the current audio message.
Supported text types include (but are not limited to):
- Hotword lists in various delimiter formats (for example: hotword1, hotword2, hotword3, hotword4)
- Free-form paragraphs or passages of any length
- Mixed content: any combination of word lists and paragraphs
- Irrelevant or meaningless text, including gibberish. The model tolerates irrelevant content well, and recognition quality rarely degrades because of it.
| Without context enhancement | With context enhancement |
|---|---|
| Without context enhancement, some investment bank names are misrecognized. For example, "Bird Rock" should be "Bulge Bracket". Result: "What insider jargon do you know in investment banking? First, the nine top foreign investment banks, Bird Rock, BB..." | With context enhancement, the investment bank names are recognized correctly. Result: "What insider jargon do you know in investment banking? First, the nine top foreign investment banks, Bulge Bracket, BB..." |
- A word list:
- List 1:
- List 2:
- List 3:
- Natural language:
- Natural language with distracting content: some text is unrelated to the audio, such as the names in the example below.
Compare models
| Feature | Fun-ASR |
|---|---|
| Supported languages | Varies by model: fun-asr, fun-asr-2025-11-07, fun-asr-mtl, fun-asr-mtl-2025-08-25: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, Jin; also supports accents from Central Plains, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong, and Taiwan, including official dialects from regions such as Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hindi, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish. fun-asr-2025-08-25: Chinese (Mandarin), English |
| Supported audio formats | aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv |
| Sample rate | Any |
| Sound channels | Any |
| Input format | Publicly accessible URLs of files to be recognized. Up to 100 audio files are supported. |
| Audio size/duration | Each audio file must be no larger than 2 GB and no longer than 12 hours. |
| Emotion recognition | Not supported |
| Timestamp | Supported (always on) |
| Punctuation prediction | Supported (always on) |
| Hotwords | Supported. The hotword feature is supported only in the primary workspace and is not available in sub-workspaces. |
| ITN | Supported (always on) |
| Singing voice recognition | Supported (fun-asr and fun-asr-2025-11-07 only) |
| Noise rejection | Supported (always on) |
| Sensitive word filtering | Supported (filters content from the Qwen Cloud sensitive word list by default) |
| Speaker diarization | Supported (off by default, can be enabled) |
| Filler word filtering | Not supported |
| VAD | Supported (always on) |
| Rate limiting (RPS) | Job submission API: 10, Task query API: 20 |
| Connection types | DashScope: Java/Python SDK, RESTful API |
| Pricing | International: $0.000035/second |
API reference
- Fun-ASR
- Qwen-ASR
- Qwen-Omni
FAQ
- Fun-ASR
- Qwen-ASR
- Qwen-Omni
How can I improve recognition accuracy?
Several factors affect accuracy. Review each and apply the corresponding optimization.Key factors:- Sound quality: Recording device quality, sample rate, and ambient noise directly affect clarity. High-quality audio input is essential.
- Speaker characteristics: Variations in pitch, speech rate, accent, and dialect increase recognition difficulty, especially for rare dialects or heavy accents.
- Language and vocabulary: Mixed languages, technical terms, or slang increase recognition difficulty. Configure hotwords to improve accuracy for domain-specific terms.
- Contextual understanding: Insufficient context can cause semantic ambiguity, especially in situations where surrounding context is needed for correct recognition.
- Optimize audio quality: Use high-performance microphones at the recommended sample rate. Minimize ambient noise and echo.
- Adapt to the speaker: For audio with strong accents or dialects, select a model that supports those specific dialects.
- Configure hotwords: Set hotwords for technical terms, proper nouns, and other specific words. For more information, see Customize hotwords.
- Preserve context: Avoid splitting audio into excessively short clips.