Skip to main content
Speech-to-text

Audio file transcription

Convert files to text

The Fun-ASR audio file recognition models convert recorded audio into text. They support single-file and batch transcription, ideal for use cases that do not require real-time results, such as meeting transcription, post-call analytics, and caption generation. Qwen Cloud also offers Qwen-ASR for recognition with enhanced semantic understanding and Qwen-Omni for prompt-based transcription with contextual understanding.
For model availability, supported languages, and feature comparison, see Speech-to-text models.

Core features

  • Multilingual recognition: Recognizes Chinese (including multiple dialects), English, Japanese, Korean, German, French, Russian, and 30+ other languages.
  • Format compatibility: Accepts any sample rate and supports major audio and video formats, including AAC, WAV, and MP3.
  • Long audio file processing: Handles asynchronous transcription for a single audio file up to 12 hours long and 2 GB in size. If speaker diarization is enabled, audio longer than 2 hours is not recommended.
  • Singing voice recognition: Transcribes entire songs, even with background music (BGM). Only the fun-asr and fun-asr-2025-11-07 models support this feature.
  • Recognition features: Configurable features include speaker diarization, sensitive word filtering, sentence-level and word-level timestamps, and hotword enhancement.

Supported models

  • Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25, fun-asr is recommended), fun-asr-mtl-2025-08-25 (snapshot)
  • Fun-ASR-Flash: fun-asr-flash-2026-06-15 (snapshot). Supports synchronous calls (up to 5 minutes) and context enhancement for improved accuracy on proper nouns.

Getting started

  • Fun-ASR
  • Qwen-ASR
  • Qwen-Omni

Model availability

ModelVersionUnit priceFree quota (Note)
fun-asr
Currently, fun-asr-2025-11-07
Stable$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-2025-11-07
Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy
Snapshot$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-2025-08-25Snapshot$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-mtl
Currently, fun-asr-mtl-2025-08-25
Stable$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-mtl-2025-08-25Snapshot$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-flash-2026-06-15
Supports synchronous calls (up to 5 minutes) and context enhancement
Snapshot$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
  • Supported languages:
    • fun-asr, fun-asr-2025-11-07, fun-asr-mtl, and fun-asr-mtl-2025-08-25: 30 languages
    • fun-asr-2025-08-25: Mandarin and English.
  • Sample rates supported: Any
  • Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Make your first call

Get an API key and set it as an environment variable. To use the SDK, install it.Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.

Async submit and sync wait

Submit a task and block until done.
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
  model='fun-asr',
  file_urls=['https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav',
      'https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav'],
  language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
  for transcription in transcription_response.output['results']:
    if transcription['subtask_status'] == 'SUCCEEDED':
      url = transcription['transcription_url']
      result = json.loads(request.urlopen(url).read().decode('utf8'))
      print(json.dumps(result, indent=4,
      ensure_ascii=False))
    else:
      print('transcription failed!')
      print(transcription)
else:
  print('Error: ', transcription_response.output.message)
First result
{
  "file_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav",
  "properties": {
    "audio_format": "pcm_s16le",
    "channels": [
      0
    ],
    "original_sampling_rate": 24000,
    "original_duration_in_milliseconds": 3280
  },
  "transcripts": [
    {
      "channel_id": 0,
      "content_duration_in_milliseconds": 3000,
      "text": "Hello world, this is Alibaba Speech Lab. ",
      "sentences": [
        {
          "begin_time": 240,
          "end_time": 3240,
          "text": "Hello world, this is Alibaba Speech Lab. ",
          "sentence_id": 1,
          "words": [
            {
              "begin_time": 240,
              "end_time": 640,
              "text": "Hello",
              "punctuation": ""
            },
            {
              "begin_time": 640,
              "end_time": 960,
              "text": " world",
              "punctuation": ","
            },
            {
              "begin_time": 1280,
              "end_time": 1480,
              "text": " this",
              "punctuation": ""
            },
            {
              "begin_time": 1480,
              "end_time": 1840,
              "text": " is",
              "punctuation": ""
            },
            {
              "begin_time": 1840,
              "end_time": 2520,
              "text": " Alibaba",
              "punctuation": ""
            },
            {
              "begin_time": 2520,
              "end_time": 2920,
              "text": " Speech",
              "punctuation": ""
            },
            {
              "begin_time": 2920,
              "end_time": 3240,
              "text": " Lab",
              "punctuation": ". "
            }
          ]
        }
      ]
    }
  ]
}
Second result
{
  "file_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav",
  "properties": {
    "audio_format": "pcm_s16le",
    "channels": [
      0
    ],
    "original_sampling_rate": 24000,
    "original_duration_in_milliseconds": 4000
  },
  "transcripts": [
    {
      "channel_id": 0,
      "content_duration_in_milliseconds": 3160,
      "text": "Hello world, this is Alibaba Speech Lab. ",
      "sentences": [
        {
          "begin_time": 800,
          "end_time": 3960,
          "text": "Hello world, this is Alibaba Speech Lab. ",
          "sentence_id": 1,
          "words": [
            {
              "begin_time": 800,
              "end_time": 1200,
              "text": "Hello",
              "punctuation": ""
            },
            {
              "begin_time": 1200,
              "end_time": 1640,
              "text": " world",
              "punctuation": ","
            },
            {
              "begin_time": 1880,
              "end_time": 2120,
              "text": " this",
              "punctuation": ""
            },
            {
              "begin_time": 2120,
              "end_time": 2560,
              "text": " is",
              "punctuation": ""
            },
            {
              "begin_time": 2560,
              "end_time": 3360,
              "text": " Alibaba",
              "punctuation": ""
            },
            {
              "begin_time": 3360,
              "end_time": 3720,
              "text": " Speech",
              "punctuation": ""
            },
            {
              "begin_time": 3720,
              "end_time": 3960,
              "text": " Lab",
              "punctuation": ". "
            }
          ]
        }
      ]
    }
  ]
}

Async submit and async query

Submit a task and poll for results instead of blocking.
from http import HTTPStatus
from dashscope.audio.asr import Transcription
import dashscope
import os
import json

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

transcribe_response = Transcription.async_call(
  model='fun-asr',
  file_urls=['https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav',
      'https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav']
)

while True:
  if transcribe_response.output.task_status == 'SUCCEEDED' or transcribe_response.output.task_status == 'FAILED':
    break
  transcribe_response = Transcription.fetch(task=transcribe_response.output.task_id)

if transcribe_response.status_code == HTTPStatus.OK:
  print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
  print('transcription done!')

RESTful API

Use any HTTP library to submit tasks and poll for results. This Python sample demonstrates the workflow:
import requests
import json
import os
import time

# If you have not configured environment variables, replace the following line with your API key: api_key = "sk-xxx"
api_key = os.getenv("DASHSCOPE_API_KEY")
file_urls = [
  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav",
  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav",
]

region = "dashscope-intl.aliyuncs.com"

# Submit a file transcription task, including a list of file URLs to be transcribed
def submit_task(apikey, file_urls) -> str:

  headers = {
    "Authorization": f"Bearer {apikey}",
    "Content-Type": "application/json",
    "X-DashScope-Async": "enable",
  }
  data = {
    "model": "fun-asr",
    "input": {"file_urls": file_urls},
    "parameters": {
      "channel_id": [0],
      # "vocabulary_id": "vocab-Xxxx", # Optional, hotword ID.
    },
  }
  # URL of the audio file transcription service
  service_url = (
    f"https://{region}/api/v1/services/audio/asr/transcription"
  )
  response = requests.post(
    service_url, headers=headers, data=json.dumps(data)
  )

  # Print the response content
  if response.status_code == 200:
    return response.json()["output"]["task_id"]
  else:
    print("task failed!")
    print(response.json())
    return None


# Recursively query the task status until it is successful
def wait_for_complete(task_id):
  headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json",
    "X-DashScope-Async": "enable",
  }

  pending = True
  while pending:
    # URL of the task status query service
    service_url = f"https://{region}/api/v1/tasks/{task_id}"
    response = requests.post(
      service_url, headers=headers
    )
    if response.status_code == 200:
      status = response.json()['output']['task_status']
      if status == 'SUCCEEDED':
        print("task succeeded!")
        pending = False
        return response.json()['output']['results']
      elif status == 'RUNNING' or status == 'PENDING':
        pass
      else:
        print("task failed!")
        pending = False
    else:
      print("query failed!")
      pending = False
    print(response.json())
    time.sleep(0.1)


task_id = submit_task(apikey=api_key, file_urls=file_urls)
print("task_id: ", task_id)
result = wait_for_complete(task_id)
print("transcription result: ", result)

Synchronous calls (fun-asr-flash-2026-06-15)

fun-asr-flash-2026-06-15 supports synchronous calls for audio files up to 5 minutes long. Results can be returned in streaming or non-streaming mode.
curl --location --request POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
  --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
  --header "Content-Type: application/json" \
  --header "X-DashScope-SSE: disable" \
  --data '{
  "model": "fun-asr-flash-2026-06-15",
  "input": {
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "input_audio",
            "input_audio": {
              "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
            }
          }
        ]
      }
    ]
  },
  "parameters": {
    "format": "wav",
    "sample_rate": "16000"
  }
}'

Context enhancement

Supported model: Only fun-asr-flash-2026-06-15 supports context enhancement. Use case: Designed for scenarios that combine ASR with a large language model. Passing previous conversation context (LLM replies and earlier recognition results) into the ASR model significantly improves transcription accuracy for proper nouns such as names, locations, and product terms — more flexible than traditional hotwords. Usage: Pass the conversation history through input.messages. Use the assistant role for the LLM's previous replies and the user role with input_text type for earlier recognition results. Context pairs must appear before the current audio message. Supported text types include (but are not limited to):
  • Hotword lists in various delimiter formats (for example: hotword1, hotword2, hotword3, hotword4)
  • Free-form paragraphs or passages of any length
  • Mixed content: any combination of word lists and paragraphs
  • Irrelevant or meaningless text, including gibberish. The model tolerates irrelevant content well, and recognition quality rarely degrades because of it.
Example: Consider an audio clip whose correct transcription is: "What insider jargon do you know in investment banking? First, the nine top foreign investment banks, Bulge Bracket, BB..."
Without context enhancementWith context enhancement
Without context enhancement, some investment bank names are misrecognized. For example, "Bird Rock" should be "Bulge Bracket".

Result: "What insider jargon do you know in investment banking? First, the nine top foreign investment banks, Bird Rock, BB..."
With context enhancement, the investment bank names are recognized correctly.

Result: "What insider jargon do you know in investment banking? First, the nine top foreign investment banks, Bulge Bracket, BB..."
To produce the corrected result, include any of the following in the context:
  • A word list:
    • List 1:
Bulge Bracket, Boutique, Middle Market, domestic securities firms
  • List 2:
Bulge Bracket Boutique Middle Market domestic securities firms
  • List 3:
['Bulge Bracket', 'Boutique', 'Middle Market', 'domestic securities firms']
  • Natural language:
Investment bank classification: a quick guide.
Recently a few friends in Australia asked me what investment banks really are. Here is a quick primer. For students studying abroad, investment banks fall into four broad categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
Bulge Bracket banks: the nine top investment banks we often refer to, including Goldman Sachs, Morgan Stanley, and so on. They are large in both business scope and scale.
Boutique banks: relatively small in size but highly focused in their service areas. Firms such as Lazard and Evercore have deep expertise in specific fields.
Middle Market banks: serve mid-sized companies with M&A, IPO, and similar services. Though smaller than the bulge brackets, they hold strong positions in specific markets.
Domestic securities firms: as the Chinese market has risen, domestic firms play an increasingly important role internationally.
There are also further breakdowns by position and business line you can find in related charts. Hopefully this helps you understand investment banks and prepare for your career.
  • Natural language with distracting content: some text is unrelated to the audio, such as the names in the example below.
Investment bank classification: a quick guide.
Recently a few friends in Australia asked me what investment banks really are. Here is a quick primer. For students studying abroad, investment banks fall into four broad categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
Bulge Bracket banks: the nine top investment banks we often refer to, including Goldman Sachs, Morgan Stanley, and so on. They are large in both business scope and scale.
Boutique banks: relatively small in size but highly focused in their service areas. Firms such as Lazard and Evercore have deep expertise in specific fields.
Middle Market banks: serve mid-sized companies with M&A, IPO, and similar services. Though smaller than the bulge brackets, they hold strong positions in specific markets.
Domestic securities firms: as the Chinese market has risen, domestic firms play an increasingly important role internationally.
There are also further breakdowns by position and business line you can find in related charts. Hopefully this helps you understand investment banks and prepare for your career.
Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing, Xu Ruoxi, Sun Haoran, Hu Jinyu, Zhu Chenxi, Guo Wenbo, He Jingshu, Gao Yuhang, Lin Yifei,
Zheng Xiaoyan, Liang Bowen, Luo Jiaqi, Song Mingzhe, Xie Wanting, Tang Ziqian, Han Mengyao, Feng Yiran, Cao Qinxue, Deng Zirui, Xiao Wangshu, Xu Jiashu,
Cheng Yinuo, Yuan Zhiruo, Peng Haoyu, Dong Simiao, Fan Jingyu, Su Zijin, Lyu Wenxuan, Jiang Shihan, Ding Muchen,
Wei Shuyao, Ren Tianyou, Jiang Yichen, Hua Qingyu, Shen Xinghe, Fu Jinyu, Yao Xingchen, Zhong Lingyu, Yan Licheng, Jin Ruoshui, Tao Ranting, Qi Shaoshang, Xue Zhilan, Zou Yunfan, Xiong Ziang, Bai Wenfeng, Yi Qianfan

Compare models

FeatureFun-ASR
Supported languagesVaries by model: fun-asr, fun-asr-2025-11-07, fun-asr-mtl, fun-asr-mtl-2025-08-25: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, Jin; also supports accents from Central Plains, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong, and Taiwan, including official dialects from regions such as Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hindi, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish. fun-asr-2025-08-25: Chinese (Mandarin), English
Supported audio formatsaac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Sample rateAny
Sound channelsAny
Input formatPublicly accessible URLs of files to be recognized. Up to 100 audio files are supported.
Audio size/durationEach audio file must be no larger than 2 GB and no longer than 12 hours.
Emotion recognitionNot supported
TimestampSupported (always on)
Punctuation predictionSupported (always on)
HotwordsSupported. The hotword feature is supported only in the primary workspace and is not available in sub-workspaces.
ITNSupported (always on)
Singing voice recognitionSupported (fun-asr and fun-asr-2025-11-07 only)
Noise rejectionSupported (always on)
Sensitive word filteringSupported (filters content from the Qwen Cloud sensitive word list by default)
Speaker diarizationSupported (off by default, can be enabled)
Filler word filteringNot supported
VADSupported (always on)
Rate limiting (RPS)Job submission API: 10, Task query API: 20
Connection typesDashScope: Java/Python SDK, RESTful API
PricingInternational: $0.000035/second

API reference

FAQ

  • Fun-ASR
  • Qwen-ASR
  • Qwen-Omni

How can I improve recognition accuracy?

Several factors affect accuracy. Review each and apply the corresponding optimization.Key factors:
  1. Sound quality: Recording device quality, sample rate, and ambient noise directly affect clarity. High-quality audio input is essential.
  2. Speaker characteristics: Variations in pitch, speech rate, accent, and dialect increase recognition difficulty, especially for rare dialects or heavy accents.
  3. Language and vocabulary: Mixed languages, technical terms, or slang increase recognition difficulty. Configure hotwords to improve accuracy for domain-specific terms.
  4. Contextual understanding: Insufficient context can cause semantic ambiguity, especially in situations where surrounding context is needed for correct recognition.
Optimization methods:
  1. Optimize audio quality: Use high-performance microphones at the recommended sample rate. Minimize ambient noise and echo.
  2. Adapt to the speaker: For audio with strong accents or dialects, select a model that supports those specific dialects.
  3. Configure hotwords: Set hotwords for technical terms, proper nouns, and other specific words. For more information, see Customize hotwords.
  4. Preserve context: Avoid splitting audio into excessively short clips.