File transcription REST
User guide: For model details and selection tips, see Audio file recognition - Fun-ASR/Paraformer.
This service has two APIs: task submission and task query. Submit a task first, then poll the query API until it completes.
Sign in to Qwen Cloud and create an API key. To avoid security risks, export the API key as an environment variable instead of hard-coding it.
This service does not accept local file uploads or Base64 audio. You must provide a publicly accessible file URL over HTTP or HTTPS, for example
Request headers:
Message body (contains all request parameters. You can omit optional fields):
If
Field descriptions:
Supported language codes by model:
Request headers:
The recognition result is a JSON file.
Key parameters:
Use any HTTP library to submit tasks and poll for results. This Python sample demonstrates the workflow:
Prerequisites
Sign in to Qwen Cloud and create an API key. To avoid security risks, export the API key as an environment variable instead of hard-coding it.
To grant temporary access or restrict sensitive operations, use a temporary token.Temporary tokens expire in 60 seconds, reducing leakage risk. Replace the API key in your code with the temporary token.
Model availability
| Model | Version | Unit price | Free quota (Note) |
|---|---|---|---|
| fun-asr Currently, fun-asr-2025-11-07 | Stable | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-2025-11-07 Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-2025-08-25 | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-mtl Currently, fun-asr-mtl-2025-08-25 | Stable | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-mtl-2025-08-25 | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
- Supported languages:
- fun-asr and fun-asr-2025-11-07: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong-Taiwan regions -- including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. Also supports English and Japanese.
- fun-asr-2025-08-25: Mandarin and English.
- fun-asr-mtl and fun-asr-mtl-2025-08-25: Mandarin, Cantonese, English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Hindi, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish.
- Sample rates supported: Any
- Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Limitations
This service does not accept local file uploads or Base64 audio. You must provide a publicly accessible file URL over HTTP or HTTPS, for example https://your-domain.com/file.mp3.
Specify the URL with the file_urls parameter. A single request supports up to 100 URLs.
- Audio formats:
aac,amr,avi,flac,flv,m4a,mkv,mov,mp3,mp4,mpeg,ogg,opus,wav,webm,wma,wmv
Many audio format variants exist. The API cannot guarantee all formats work correctly. Test your files to verify results.
- Audio sample rate: Any
- File size and duration: Max 2 GB, max 12 hours. For files exceeding these limits, pre-process them first. See Preprocess audio files with FFmpeg.
- Batch size: Up to 100 file URLs per request.
- Supported languages: fun-asr supports Chinese and English. fun-asr-mtl-2025-08-25 supports Chinese, Cantonese, English, Japanese, Thai, Vietnamese, and Indonesian.
- Frontend calls: You cannot call the API from the frontend. Use a backend proxy.
Task submission API
Basic information
| Item | Description |
|---|---|
| Description | Submits a speech recognition task. |
| URL | https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription |
| Request method | POST |
| Request headers | See below |
| Message body | See below |
The
X-DashScope-Async: enable header is required.Request parameters
Click to view a request sample
Click to view a request sample
| Parameter | Type | Default value | Required | Description |
|---|---|---|---|---|
| model | string | - | Yes | The model name. See Model availability. |
| file_urls | array[string] | - | Yes | A list of audio or video file URLs (HTTP/HTTPS). Up to 100 URLs per request. |
| vocabulary_id | string | - | No | The hotword ID. Applies the hotwords to this task. Disabled by default. See Customize hotwords. |
| channel_id | array[integer] | [0] | No | Audio track indexes to recognize in a multi-track file. Starts from 0. For example, [0] recognizes the first track, [0, 1] recognizes both. Defaults to the first track. |
| special_word_filter | string | - | No | Configures sensitive word handling. See Sensitive word filter details. |
| diarization_enabled | boolean | false | No | Enables speaker diarization. Single-channel audio only. When enabled, results include speaker_id to distinguish speakers. See Recognition results. |
| speaker_count | integer | - | No | A reference value for the number of speakers (2 to 100). Takes effect only when diarization_enabled is true. The algorithm tries to output this number of speakers but cannot guarantee it. Defaults to automatic detection. |
| language_hints | array[string] | ["zh", "en"] | No | Language codes for recognition. If unset, the model detects the language automatically. See Supported languages. |
Each audio track in
channel_id is billed separately. Example: [0, 1] on one file = two charges.Sensitive word filter details
If special_word_filter is not set, the built-in filter replaces matched words with asterisks (*) of equal length.
If set, you can use these policies:
- Replace with
*: Replaces matched words with asterisks of the same length. - Filter out: Removes matched words from the result.
-
filter_with_signed- Type: object. Required: No.
- Matched words are replaced with asterisks of the same length.
- Example: "Help me test this piece of code" becomes "Help me **** this piece of code".
- Internal field:
word_list-- A string array of words to replace.
-
filter_with_empty- Type: object. Required: No.
- Matched words are removed from the result.
- Example: "Is the game about to start?" becomes "Is the game about to ?".
- Internal field:
word_list-- A string array of words to remove.
-
system_reserved_filter- Type: Boolean. Required: No. Default:
true. - Enables the system's preset sensitive word rules. When
true, words matching the Qwen Cloud sensitive word list are replaced with asterisks of the same length.
- Type: Boolean. Required: No. Default:
Supported languages
Supported language codes by model:
-
fun-asr, fun-asr-2025-11-07:
zh: Chineseen: Englishja: Japanese
-
fun-asr-2025-08-25:
zh: Chineseen: English
-
fun-asr-mtl, fun-asr-mtl-2025-08-25:
zh: Chineseen: Englishja: Japaneseko: Koreanvi: Vietnameseid: Indonesianth: Thaims: Malaytl: Filipinoar: Arabichi: Hindibg: Bulgarianhr: Croatiancs: Czechda: Danishnl: Dutchet: Estonianfi: Finnishel: Greekhu: Hungarianga: Irishlv: Latvianlt: Lithuanianmt: Maltesepl: Polishpt: Portuguesero: Romaniansk: Slovaksl: Sloveniansv: Swedish
Response parameters
Click to view a response sample
Click to view a response sample
| Parameter | Type | Description |
|---|---|---|
| task_status | string | Task status: PENDING, RUNNING, SUCCEEDED, or FAILED. |
| task_id | string | The task ID. Use it with the task query API to check results. |
| request_id | string | The request ID. |
Task query API
Basic information
| Item | Description |
|---|---|
| Description | Queries the status and results of a speech recognition task. |
| URL | https://dashscope-intl.aliyuncs.com/api/v1/tasks/\{task_id\} |
| Request method | GET |
| Request headers | See below |
| Message body | None |
Request parameters
Click to view a request sample
Click to view a request sample
| Parameter | Type | Default value | Required | Description |
|---|---|---|---|---|
| task_id | string | - | Yes | The task ID returned by the task submission API. |
Response parameters
Multi-subtask jobs: overall status shows
SUCCEEDED if any subtask succeeds. Check subtask_status for individual results.Click to view a response sample (success)
Click to view a response sample (success)
Click to view a response sample (partial failure)
Click to view a response sample (partial failure)
The
code field contains the error code, and the message field contains the error message. These fields appear only on errors.| Parameter | Type | Description |
|---|---|---|
| task_id | string | The task ID. |
| task_status | string | The task status. |
| subtask_status | string | The subtask status. |
| file_url | string | The URL of the processed file. |
| transcription_url | string | The link to the recognition result. Valid for 24 hours. After expiry, you cannot query the task or download the result. The result is a JSON file you can download or read via HTTP. See Recognition results. |
| submit_time | string | The time the task was submitted. |
| scheduled_time | string | The time the task was scheduled. |
| end_time | string | The time the task ended. |
| task_metrics | object | Task metrics: TOTAL, SUCCEEDED, and FAILED counts. |
| usage | object | Usage information. duration is the total duration in seconds. |
Description of recognition results
The recognition result is a JSON file.
Click to view a recognition result example
Click to view a recognition result example
The
speaker_id field appears only when speaker diarization is enabled. Other word entries are omitted for brevity.| Parameter | Type | Description |
|---|---|---|
| audio_format | string | The audio format of the source file. |
| channels | array[integer] | The audio track indexes. Returns [0] for single-track, [0, 1] for dual-track, etc. |
| original_sampling_rate | integer | The sample rate (Hz). |
| original_duration_in_milliseconds | integer | The original audio duration (ms). |
| channel_id | integer | The transcribed track index, starting from 0. |
| content_duration_in_milliseconds | integer | The duration of speech content in the track (ms). |
| text | string | The transcription text (paragraph-level or word-level, depending on context). |
| sentences | array | Sentence-level transcription results. |
| words | array | Word-level transcription results. |
| begin_time | integer | The start timestamp (ms). |
| end_time | integer | The end timestamp (ms). |
| speaker_id | integer | The speaker index, starting from 0. Appears only when diarization is enabled. |
| punctuation | string | The predicted punctuation after the word, if any. |
Billing is based on speech segments only, not total file duration. Non-speech segments are not billed. Because speech detection uses an AI model, billed duration may differ slightly from expected content.