File transcription Python
User guide: For model details and recommendations, see Audio file recognition - Fun-ASR/Paraformer.
Files must be at public URLs (HTTP/HTTPS, such as
Pass these parameters to
By default, words on the Qwen Cloud sensitive word list are replaced with asterisks (
Fields:
Language codes by model:
Key parameters:
Key parameters:
Results are JSON files.
Key parameters:
Import with
Prerequisites
- Sign in to Qwen Cloud and create an API key. Set the API key as an environment variable.
For temporary access to third-party apps, use a temporary token. Tokens expire in 60 seconds, limiting leakage risk.
Model availability
| Model | Version | Unit price | Free quota (Note) |
|---|---|---|---|
| fun-asr Currently, fun-asr-2025-11-07 | Stable | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-2025-11-07 Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-2025-08-25 | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-mtl Currently, fun-asr-mtl-2025-08-25 | Stable | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-mtl-2025-08-25 | Snapshot | $0.000035/second | 36,000 seconds (10 hours) Valid for 90 days |
- Supported languages:
- fun-asr, fun-asr-2025-11-07, fun-asr-mtl, and fun-asr-mtl-2025-08-25: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin; also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong-Taiwan regions -- including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hindi, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, and Swedish.
- fun-asr-2025-08-25: Mandarin and English.
- Sample rates supported: Any
- Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Limitations
Files must be at public URLs (HTTP/HTTPS, such as https://your-domain.com/file.mp3). Local files and Base64 encoding are not supported.
Pass URLs with the file_urls parameter. Up to 100 URLs per request.
- Audio formats:
aac,amr,avi,flac,flv,m4a,mkv,mov,mp3,mp4,mpeg,ogg,opus,wav,webm,wma,wmv
Not all format variants are tested. Test your files to verify results.
- Audio sample rate: Any
- File size and duration: Max 2 GB and 12 hours. For larger files, see Audio trimming.
- Batch processing: Up to 100 URLs per request.
- Languages: fun-asr, fun-asr-mtl, and their snapshot versions support Chinese and 29 other languages. fun-asr-2025-08-25 supports Chinese and English only. See Supported languages.
Request parameters
Pass these parameters to async_call on the Transcription class.
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
| model | str | - | Yes | Model ID. See Model availability. |
| file_urls | list[str] | - | Yes | Audio/video file URLs (HTTP/HTTPS). Up to 100 per request. |
| vocabulary_id | str | - | No | Hotword vocabulary ID for this task. Disabled by default. See Customize hotwords. |
| channel_id | list[int] | [0] | No | Audio track indexes to recognize (0-based). [0] = first track, [0, 1] = first and second. Each track is billed separately. |
| special_word_filter | str | - | No | Sensitive word filter config. See Sensitive word filter. |
| diarization_enabled | bool | False | No | Enable speaker diarization (single-channel only). Results include speaker_id. See Recognition result. |
| speaker_count | int | - | No | Expected speaker count (2-100). Only applies when diarization_enabled is true. Auto-detected by default. Guides the algorithm but does not guarantee exact count. |
| language_hints | list[str] | ["zh", "en"] | No | Language codes. Leave unset for auto-detection. See Supported languages. |
| speech_noise_threshold | float | - | No | Speech noise threshold. |
Sensitive word filter
By default, words on the Qwen Cloud sensitive word list are replaced with asterisks (*).
With special_word_filter, you can:
- Replace with
*: Matched words become asterisks. - Filter out: Matched words are removed.
filter_with_signed(object, optional): Words to replace with*.- Example: "Help me test this code" becomes "Help me **** this code"
word_list: Words to replace.
filter_with_empty(object, optional): Words to remove.- Example: "Is the game about to start?" becomes "Is the game about to?"
word_list: Words to remove.
system_reserved_filter(boolean, optional, default: true): Enable system filtering. When true, words on the Qwen Cloud sensitive word list are replaced with*.
Supported languages
Language codes by model:
- fun-asr, fun-asr-2025-11-07, fun-asr-mtl, fun-asr-mtl-2025-08-25:
zh: Chineseen: Englishja: Japaneseko: Koreanvi: Vietnameseid: Indonesianth: Thaims: Malaytl: Filipinoar: Arabicbg: Bulgarianhr: Croatiancs: Czechda: Danishnl: Dutchet: Estonianfi: Finnishel: Greekhi: Hindihu: Hungarianga: Irishlv: Latvianlt: Lithuanianmt: Maltesepl: Polishpt: Portuguesero: Romaniansk: Slovaksl: Sloveniansv: Swedish
- fun-asr-2025-08-25:
zh: Chineseen: English
Response results
TranscriptionResponse
TranscriptionResponse contains task info (task_id, task_status) and results in output. See TranscriptionOutput.
Click to view a sample TranscriptionResponse structure
Click to view a sample TranscriptionResponse structure
- PENDING status
- RUNNING status
- SUCCEEDED status
- FAILED status
| Parameter | Description |
|---|---|
| status_code | HTTP status code. |
| code | Ignore top-level code. Check output.results[].code for errors. |
| message | Ignore top-level message. Check output.results[].message for errors. |
| task_id | Task ID. |
| task_status | Task status: PENDING, RUNNING, SUCCEEDED, FAILED. If any subtask succeeds, the task is SUCCEEDED. Check subtask_status for individual results. |
| results | Subtask results. |
| subtask_status | Subtask status: PENDING, RUNNING, SUCCEEDED, FAILED. |
| file_url | Audio file URL. |
| transcription_url | Result URL (JSON file). Download or read via HTTP. See Recognition result. |
TranscriptionOutput
TranscriptionOutput is the output property of TranscriptionResponse.
Click to view a sample TranscriptionOutput structure
Click to view a sample TranscriptionOutput structure
- PENDING status
- RUNNING status
- SUCCEEDED status
- FAILED status
| Parameter | Description |
|---|---|
| code | Error code. |
| message | Error message. |
| task_id | Task ID. |
| task_status | Task status: PENDING, RUNNING, SUCCEEDED, FAILED. If any subtask succeeds, the task is SUCCEEDED. Check subtask_status for individual results. |
| results | Subtask results. |
| subtask_status | Subtask status: PENDING, RUNNING, SUCCEEDED, FAILED. |
| file_url | Audio file URL. |
| transcription_url | Result URL (JSON file). Download or read via HTTP. See Recognition result. |
Recognition result
Results are JSON files.
Click to view a recognition result example
Click to view a recognition result example
speaker_id appears only when speaker diarization is enabled.| Parameter | Type | Description |
|---|---|---|
| audio_format | string | Audio format. |
| channels | array[integer] | Track indexes. [0] = single-track, [0, 1] = dual-track. |
| original_sampling_rate | integer | Sample rate (Hz). |
| original_duration_in_milliseconds | integer | Audio duration (ms). |
| channel_id | integer | Track index (0-based). |
| content_duration_in_milliseconds | integer | Speech duration (ms). Only speech is transcribed and billed. Non-speech is excluded. Speech duration is usually shorter than audio duration. |
| transcript | string | Paragraph-level text. |
| sentences | array | Sentence-level results. |
| words | array | Word-level results. |
| begin_time | integer | Start time (ms). |
| end_time | integer | End time (ms). |
| text | string | Transcription text. |
| speaker_id | integer | Speaker index (0-based). Only present when diarization is enabled. |
| punctuation | string | Predicted punctuation after the word. |
Transcription class
Import with from dashscope.audio.asr import Transcription.
| Method | Signature | Description |
|---|---|---|
| async_call | @classmethod def async_call(cls, model: str, file_urls: List[str], phrase_id: str = None, api_key: str = None, workspace: str = None, **kwargs) -> TranscriptionResponse | Submit a recognition task. |
| wait | @classmethod def wait(cls, task: Union[str, TranscriptionResponse], api_key: str = None, workspace: str = None, **kwargs) -> TranscriptionResponse | Block until done (SUCCEEDED or FAILED). Returns a TranscriptionResponse. |
| fetch | @classmethod def fetch(cls, task: Union[str, TranscriptionResponse], api_key: str = None, workspace: str = None, **kwargs) -> TranscriptionResponse | Query task status. Returns a TranscriptionResponse. |