Skip to main content
Speech-to-text

Audio file transcription

Convert files to text

Qwen Cloud offers three model families for audio file transcription: Fun-ASR for high-accuracy multilingual transcription with singing recognition, Qwen-ASR for recognition with enhanced semantic understanding, and Qwen-Omni for prompt-based transcription with contextual understanding.
For model availability, supported languages, and feature comparison, see Speech-to-text models.

Getting started

  • Fun-ASR
  • Qwen-ASR
  • Qwen-Omni
The following sections provide sample code for API calls.Get an API key and set it as an environment variable. To use the SDK, install it.Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
  model='fun-asr',
  file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/hello_world_female2.wav',
      'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/hello_world_male2.wav'],
  language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
  for transcription in transcription_response.output['results']:
    if transcription['subtask_status'] == 'SUCCEEDED':
      url = transcription['transcription_url']
      result = json.loads(request.urlopen(url).read().decode('utf8'))
      print(json.dumps(result, indent=4,
      ensure_ascii=False))
    else:
      print('transcription failed!')
      print(transcription)
else:
  print('Error: ', transcription_response.output.message)
First result
{
  "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/hello_world_female2.wav",
  "properties": {
    "audio_format": "pcm_s16le",
    "channels": [
      0
    ],
    "original_sampling_rate": 16000,
    "original_duration_in_milliseconds": 3834
  },
  "transcripts": [
    {
      "channel_id": 0,
      "content_duration_in_milliseconds": 2480,
      "text": "Hello World, this is Alibaba Speech Lab.",
      "sentences": [
        {
          "begin_time": 760,
          "end_time": 3240,
          "text": "Hello World, this is Alibaba Speech Lab.",
          "sentence_id": 1,
          "words": [
            {
              "begin_time": 760,
              "end_time": 1000,
              "text": "Hello",
              "punctuation": ""
            },
            {
              "begin_time": 1000,
              "end_time": 1120,
              "text": " World",
              "punctuation": ", "
            },
            {
              "begin_time": 1400,
              "end_time": 1920,
              "text": "this is",
              "punctuation": ""
            },
            {
              "begin_time": 1920,
              "end_time": 2520,
              "text": "Alibaba",
              "punctuation": ""
            },
            {
              "begin_time": 2520,
              "end_time": 2840,
              "text": "Speech",
              "punctuation": ""
            },
            {
              "begin_time": 2840,
              "end_time": 3240,
              "text": "Lab",
              "punctuation": "."
            }
          ]
        }
      ]
    }
  ]
}
Second result
{
  "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/hello_world_male2.wav",
  "properties": {
    "audio_format": "pcm_s16le",
    "channels": [
      0
    ],
    "original_sampling_rate": 16000,
    "original_duration_in_milliseconds": 4726
  },
  "transcripts": [
    {
      "channel_id": 0,
      "content_duration_in_milliseconds": 3800,
      "text": "Hello World, this is Alibaba Speech Lab.",
      "sentences": [
        {
          "begin_time": 680,
          "end_time": 4480,
          "text": "Hello World, this is Alibaba Speech Lab.",
          "sentence_id": 1,
          "words": [
            {
              "begin_time": 680,
              "end_time": 960,
              "text": "Hello",
              "punctuation": ""
            },
            {
              "begin_time": 960,
              "end_time": 1080,
              "text": " World",
              "punctuation": ", "
            },
            {
              "begin_time": 1480,
              "end_time": 2160,
              "text": "this is",
              "punctuation": ""
            },
            {
              "begin_time": 2160,
              "end_time": 3080,
              "text": "Alibaba",
              "punctuation": ""
            },
            {
              "begin_time": 3080,
              "end_time": 3520,
              "text": "Speech",
              "punctuation": ""
            },
            {
              "begin_time": 3520,
              "end_time": 4480,
              "text": "Lab",
              "punctuation": "."
            }
          ]
        }
      ]
    }
  ]
}

API reference

FAQ

  • Fun-ASR
  • Qwen-ASR
  • Qwen-Omni

How can I improve recognition accuracy?

You should consider all relevant factors and take appropriate action.Key factors include the following:
  1. Sound quality: The quality of the recording device, the sample rate, and environmental noise affect audio clarity. High-quality audio is essential for accurate recognition.
  2. Speaker characteristics: Differences in pitch, speech rate, accent, and dialect can make recognition more difficult, especially for rare dialects or heavy accents.
  3. Language and vocabulary: Mixed languages, professional jargon, or slang can make recognition more difficult. You can configure hotwords to optimize recognition for these cases.
  4. Contextual understanding: Lack of context can lead to semantic ambiguity, especially in situations where context is necessary for correct recognition.
Optimization methods:
  1. Optimize audio quality: Use high-performance microphones and devices that support the recommended sample rate. Reduce environmental noise and echo.
  2. Adapt to the speaker: For scenarios that involve strong accents or diverse dialects, choose a model that supports those dialects.
  3. Configure hotwords: Set hotwords for technical terms, proper nouns, and other specific terms. For more information, see Customize hotwords.
  4. Preserve context: Avoid segmenting audio into clips that are too short.