Skip to main content
Speech-to-text

Audio file transcription

Convert files to text

The Fun-ASR audio file recognition models convert recorded audio into text. They support single-file and batch transcription, ideal for use cases that do not require real-time results, such as meeting transcription, post-call analytics, and caption generation. Qwen Cloud also offers Qwen-ASR for recognition with enhanced semantic understanding and Qwen-Omni for prompt-based transcription with contextual understanding.
For model availability, supported languages, and feature comparison, see Speech-to-text models.

Core features

  • Multilingual recognition: Recognizes Chinese (including multiple dialects), English, Japanese, Korean, German, French, Russian, and 30+ other languages.
  • Format compatibility: Accepts any sample rate and supports major audio and video formats, including AAC, WAV, and MP3.
  • Long audio file processing: Handles asynchronous transcription for a single audio file up to 12 hours long and 2 GB in size. If speaker diarization is enabled, audio longer than 2 hours is not recommended.
  • Singing voice recognition: Transcribes entire songs, even with background music (BGM). Only the fun-asr and fun-asr-2025-11-07 models support this feature.
  • Recognition features: Configurable features include speaker diarization, sensitive word filtering, sentence-level and word-level timestamps, and hotword enhancement.

Supported models

  • Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25, fun-asr is recommended), fun-asr-mtl-2025-08-25 (snapshot)

Getting started

  • Fun-ASR
  • Qwen-ASR
  • Qwen-Omni

Model availability

ModelVersionUnit priceFree quota (Note)
fun-asr
Currently, fun-asr-2025-11-07
Stable$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-2025-11-07
Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy
Snapshot$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-2025-08-25Snapshot$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-mtl
Currently, fun-asr-mtl-2025-08-25
Stable$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-mtl-2025-08-25Snapshot$0.000035/second36,000 seconds (10 hours)
Valid for 90 days
  • Supported languages:
    • fun-asr, fun-asr-2025-11-07, fun-asr-mtl, and fun-asr-mtl-2025-08-25: 30 languages
    • fun-asr-2025-08-25: Mandarin and English.
  • Sample rates supported: Any
  • Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Make your first call

Get an API key and set it as an environment variable. To use the SDK, install it.Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.

Async submit and sync wait

Submit a task and block until done.
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
  model='fun-asr',
  file_urls=['https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav',
      'https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav'],
  language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
  for transcription in transcription_response.output['results']:
    if transcription['subtask_status'] == 'SUCCEEDED':
      url = transcription['transcription_url']
      result = json.loads(request.urlopen(url).read().decode('utf8'))
      print(json.dumps(result, indent=4,
      ensure_ascii=False))
    else:
      print('transcription failed!')
      print(transcription)
else:
  print('Error: ', transcription_response.output.message)
First result
{
  "file_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav",
  "properties": {
    "audio_format": "pcm_s16le",
    "channels": [
      0
    ],
    "original_sampling_rate": 24000,
    "original_duration_in_milliseconds": 3280
  },
  "transcripts": [
    {
      "channel_id": 0,
      "content_duration_in_milliseconds": 3000,
      "text": "Hello world, this is Alibaba Speech Lab. ",
      "sentences": [
        {
          "begin_time": 240,
          "end_time": 3240,
          "text": "Hello world, this is Alibaba Speech Lab. ",
          "sentence_id": 1,
          "words": [
            {
              "begin_time": 240,
              "end_time": 640,
              "text": "Hello",
              "punctuation": ""
            },
            {
              "begin_time": 640,
              "end_time": 960,
              "text": " world",
              "punctuation": ","
            },
            {
              "begin_time": 1280,
              "end_time": 1480,
              "text": " this",
              "punctuation": ""
            },
            {
              "begin_time": 1480,
              "end_time": 1840,
              "text": " is",
              "punctuation": ""
            },
            {
              "begin_time": 1840,
              "end_time": 2520,
              "text": " Alibaba",
              "punctuation": ""
            },
            {
              "begin_time": 2520,
              "end_time": 2920,
              "text": " Speech",
              "punctuation": ""
            },
            {
              "begin_time": 2920,
              "end_time": 3240,
              "text": " Lab",
              "punctuation": ". "
            }
          ]
        }
      ]
    }
  ]
}
Second result
{
  "file_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav",
  "properties": {
    "audio_format": "pcm_s16le",
    "channels": [
      0
    ],
    "original_sampling_rate": 24000,
    "original_duration_in_milliseconds": 4000
  },
  "transcripts": [
    {
      "channel_id": 0,
      "content_duration_in_milliseconds": 3160,
      "text": "Hello world, this is Alibaba Speech Lab. ",
      "sentences": [
        {
          "begin_time": 800,
          "end_time": 3960,
          "text": "Hello world, this is Alibaba Speech Lab. ",
          "sentence_id": 1,
          "words": [
            {
              "begin_time": 800,
              "end_time": 1200,
              "text": "Hello",
              "punctuation": ""
            },
            {
              "begin_time": 1200,
              "end_time": 1640,
              "text": " world",
              "punctuation": ","
            },
            {
              "begin_time": 1880,
              "end_time": 2120,
              "text": " this",
              "punctuation": ""
            },
            {
              "begin_time": 2120,
              "end_time": 2560,
              "text": " is",
              "punctuation": ""
            },
            {
              "begin_time": 2560,
              "end_time": 3360,
              "text": " Alibaba",
              "punctuation": ""
            },
            {
              "begin_time": 3360,
              "end_time": 3720,
              "text": " Speech",
              "punctuation": ""
            },
            {
              "begin_time": 3720,
              "end_time": 3960,
              "text": " Lab",
              "punctuation": ". "
            }
          ]
        }
      ]
    }
  ]
}

Async submit and async query

Submit a task and poll for results instead of blocking.
from http import HTTPStatus
from dashscope.audio.asr import Transcription
import dashscope
import os
import json

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

transcribe_response = Transcription.async_call(
  model='fun-asr',
  file_urls=['https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav',
      'https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav']
)

while True:
  if transcribe_response.output.task_status == 'SUCCEEDED' or transcribe_response.output.task_status == 'FAILED':
    break
  transcribe_response = Transcription.fetch(task=transcribe_response.output.task_id)

if transcribe_response.status_code == HTTPStatus.OK:
  print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
  print('transcription done!')

RESTful API

Use any HTTP library to submit tasks and poll for results. This Python sample demonstrates the workflow:
import requests
import json
import os
import time

# If you have not configured environment variables, replace the following line with your API key: api_key = "sk-xxx"
api_key = os.getenv("DASHSCOPE_API_KEY")
file_urls = [
  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav",
  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav",
]

region = "dashscope-intl.aliyuncs.com"

# Submit a file transcription task, including a list of file URLs to be transcribed
def submit_task(apikey, file_urls) -> str:

  headers = {
    "Authorization": f"Bearer {apikey}",
    "Content-Type": "application/json",
    "X-DashScope-Async": "enable",
  }
  data = {
    "model": "fun-asr",
    "input": {"file_urls": file_urls},
    "parameters": {
      "channel_id": [0],
      # "vocabulary_id": "vocab-Xxxx", # Optional, hotword ID.
    },
  }
  # URL of the audio file transcription service
  service_url = (
    f"https://{region}/api/v1/services/audio/asr/transcription"
  )
  response = requests.post(
    service_url, headers=headers, data=json.dumps(data)
  )

  # Print the response content
  if response.status_code == 200:
    return response.json()["output"]["task_id"]
  else:
    print("task failed!")
    print(response.json())
    return None


# Recursively query the task status until it is successful
def wait_for_complete(task_id):
  headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json",
    "X-DashScope-Async": "enable",
  }

  pending = True
  while pending:
    # URL of the task status query service
    service_url = f"https://{region}/api/v1/tasks/{task_id}"
    response = requests.post(
      service_url, headers=headers
    )
    if response.status_code == 200:
      status = response.json()['output']['task_status']
      if status == 'SUCCEEDED':
        print("task succeeded!")
        pending = False
        return response.json()['output']['results']
      elif status == 'RUNNING' or status == 'PENDING':
        pass
      else:
        print("task failed!")
        pending = False
    else:
      print("query failed!")
      pending = False
    print(response.json())
    time.sleep(0.1)


task_id = submit_task(apikey=api_key, file_urls=file_urls)
print("task_id: ", task_id)
result = wait_for_complete(task_id)
print("transcription result: ", result)

Compare models

FeatureFun-ASR
Supported languagesVaries by model: fun-asr, fun-asr-2025-11-07, fun-asr-mtl, fun-asr-mtl-2025-08-25: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, Jin; also supports accents from Central Plains, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong, and Taiwan, including official dialects from regions such as Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hindi, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish. fun-asr-2025-08-25: Chinese (Mandarin), English
Supported audio formatsaac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Sample rateAny
Sound channelsAny
Input formatPublicly accessible URLs of files to be recognized. Up to 100 audio files are supported.
Audio size/durationEach audio file must be no larger than 2 GB and no longer than 12 hours.
Emotion recognitionNot supported
TimestampSupported (always on)
Punctuation predictionSupported (always on)
HotwordsSupported. The hotword feature is supported only in the primary workspace and is not available in sub-workspaces.
ITNSupported (always on)
Singing voice recognitionSupported (fun-asr and fun-asr-2025-11-07 only)
Noise rejectionSupported (always on)
Sensitive word filteringSupported (filters content from the Qwen Cloud sensitive word list by default)
Speaker diarizationSupported (off by default, can be enabled)
Filler word filteringNot supported
VADSupported (always on)
Rate limiting (RPS)Job submission API: 10, Task query API: 20
Connection typesDashScope: Java/Python SDK, RESTful API
PricingInternational: $0.000035/second

API reference

FAQ

  • Fun-ASR
  • Qwen-ASR
  • Qwen-Omni

How can I improve recognition accuracy?

Several factors affect accuracy. Review each and apply the corresponding optimization.Key factors:
  1. Sound quality: Recording device quality, sample rate, and ambient noise directly affect clarity. High-quality audio input is essential.
  2. Speaker characteristics: Variations in pitch, speech rate, accent, and dialect increase recognition difficulty, especially for rare dialects or heavy accents.
  3. Language and vocabulary: Mixed languages, technical terms, or slang increase recognition difficulty. Configure hotwords to improve accuracy for domain-specific terms.
  4. Contextual understanding: Insufficient context can cause semantic ambiguity, especially in situations where surrounding context is needed for correct recognition.
Optimization methods:
  1. Optimize audio quality: Use high-performance microphones at the recommended sample rate. Minimize ambient noise and echo.
  2. Adapt to the speaker: For audio with strong accents or dialects, select a model that supports those specific dialects.
  3. Configure hotwords: Set hotwords for technical terms, proper nouns, and other specific words. For more information, see Customize hotwords.
  4. Preserve context: Avoid splitting audio into excessively short clips.