Audio file transcription

The Fun-ASR audio file recognition models convert recorded audio into text. They support single-file and batch transcription, ideal for use cases that do not require real-time results, such as meeting transcription, post-call analytics, and caption generation. Qwen Cloud also offers Qwen-ASR for recognition with enhanced semantic understanding and Qwen-Omni for prompt-based transcription with contextual understanding.

For model availability, supported languages, and feature comparison, see Speech-to-text models.

Core features

Multilingual recognition: Recognizes Chinese (including multiple dialects), English, Japanese, Korean, German, French, Russian, and 30+ other languages.
Format compatibility: Accepts any sample rate and supports major audio and video formats, including AAC, WAV, and MP3.
Long audio file processing: Handles asynchronous transcription for a single audio file up to 12 hours long and 2 GB in size. If speaker diarization is enabled, audio longer than 2 hours is not recommended.
Singing voice recognition: Transcribes entire songs, even with background music (BGM). Only the fun-asr and fun-asr-2025-11-07 models support this feature.
Recognition features: Configurable features include speaker diarization, sensitive word filtering, sentence-level and word-level timestamps, and hotword enhancement.

Supported models

Fun-ASR: fun-asr (stable, currently equivalent to fun-asr-2025-11-07), fun-asr-2025-11-07 (snapshot), fun-asr-2025-08-25 (snapshot), fun-asr-mtl (stable, currently equivalent to fun-asr-mtl-2025-08-25, fun-asr is recommended), fun-asr-mtl-2025-08-25 (snapshot)

Getting started

Fun-ASR
Qwen-ASR
Qwen-Omni

Model availability

Model	Version	Unit price	Free quota (Note)
fun-asr Currently, fun-asr-2025-11-07	Stable	$0.000035/second	36,000 seconds (10 hours) Valid for 90 days
fun-asr-2025-11-07 Improved far-field VAD over fun-asr-2025-08-25 for higher accuracy	Snapshot	$0.000035/second	36,000 seconds (10 hours) Valid for 90 days
fun-asr-2025-08-25	Snapshot	$0.000035/second	36,000 seconds (10 hours) Valid for 90 days
fun-asr-mtl Currently, fun-asr-mtl-2025-08-25	Stable	$0.000035/second	36,000 seconds (10 hours) Valid for 90 days
fun-asr-mtl-2025-08-25	Snapshot	$0.000035/second	36,000 seconds (10 hours) Valid for 90 days

Supported languages:
- fun-asr, fun-asr-2025-11-07, fun-asr-mtl, and fun-asr-mtl-2025-08-25: 30 languages
- fun-asr-2025-08-25: Mandarin and English.
Sample rates supported: Any
Audio formats supported: aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Make your first call

Get an API key and set it as an environment variable. To use the SDK, install it.Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.

Async submit and sync wait

Submit a task and block until done.

from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

task_response = Transcription.async_call(
  model='fun-asr',
  file_urls=['https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav',
      'https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav'],
  language_hints=['zh', 'en']  # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)

transcription_response = Transcription.wait(task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
  for transcription in transcription_response.output['results']:
    if transcription['subtask_status'] == 'SUCCEEDED':
      url = transcription['transcription_url']
      result = json.loads(request.urlopen(url).read().decode('utf8'))
      print(json.dumps(result, indent=4,
      ensure_ascii=False))
    else:
      print('transcription failed!')
      print(transcription)
else:
  print('Error: ', transcription_response.output.message)

The complete recognition result is printed to the console in JSON format. The result includes the transcribed text and the start and end times of the text in the audio or video file, specified in milliseconds.

First result

{
  "file_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav",
  "properties": {
    "audio_format": "pcm_s16le",
    "channels": [
      0
    ],
    "original_sampling_rate": 24000,
    "original_duration_in_milliseconds": 3280
  },
  "transcripts": [
    {
      "channel_id": 0,
      "content_duration_in_milliseconds": 3000,
      "text": "Hello world, this is Alibaba Speech Lab. ",
      "sentences": [
        {
          "begin_time": 240,
          "end_time": 3240,
          "text": "Hello world, this is Alibaba Speech Lab. ",
          "sentence_id": 1,
          "words": [
            {
              "begin_time": 240,
              "end_time": 640,
              "text": "Hello",
              "punctuation": ""
            },
            {
              "begin_time": 640,
              "end_time": 960,
              "text": " world",
              "punctuation": ","
            },
            {
              "begin_time": 1280,
              "end_time": 1480,
              "text": " this",
              "punctuation": ""
            },
            {
              "begin_time": 1480,
              "end_time": 1840,
              "text": " is",
              "punctuation": ""
            },
            {
              "begin_time": 1840,
              "end_time": 2520,
              "text": " Alibaba",
              "punctuation": ""
            },
            {
              "begin_time": 2520,
              "end_time": 2920,
              "text": " Speech",
              "punctuation": ""
            },
            {
              "begin_time": 2920,
              "end_time": 3240,
              "text": " Lab",
              "punctuation": ". "
            }
          ]
        }
      ]
    }
  ]
}

Second result

{
  "file_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav",
  "properties": {
    "audio_format": "pcm_s16le",
    "channels": [
      0
    ],
    "original_sampling_rate": 24000,
    "original_duration_in_milliseconds": 4000
  },
  "transcripts": [
    {
      "channel_id": 0,
      "content_duration_in_milliseconds": 3160,
      "text": "Hello world, this is Alibaba Speech Lab. ",
      "sentences": [
        {
          "begin_time": 800,
          "end_time": 3960,
          "text": "Hello world, this is Alibaba Speech Lab. ",
          "sentence_id": 1,
          "words": [
            {
              "begin_time": 800,
              "end_time": 1200,
              "text": "Hello",
              "punctuation": ""
            },
            {
              "begin_time": 1200,
              "end_time": 1640,
              "text": " world",
              "punctuation": ","
            },
            {
              "begin_time": 1880,
              "end_time": 2120,
              "text": " this",
              "punctuation": ""
            },
            {
              "begin_time": 2120,
              "end_time": 2560,
              "text": " is",
              "punctuation": ""
            },
            {
              "begin_time": 2560,
              "end_time": 3360,
              "text": " Alibaba",
              "punctuation": ""
            },
            {
              "begin_time": 3360,
              "end_time": 3720,
              "text": " Speech",
              "punctuation": ""
            },
            {
              "begin_time": 3720,
              "end_time": 3960,
              "text": " Lab",
              "punctuation": ". "
            }
          ]
        }
      ]
    }
  ]
}

Async submit and async query

Submit a task and poll for results instead of blocking.

from http import HTTPStatus
from dashscope.audio.asr import Transcription
import dashscope
import os
import json

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

transcribe_response = Transcription.async_call(
  model='fun-asr',
  file_urls=['https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav',
      'https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav']
)

while True:
  if transcribe_response.output.task_status == 'SUCCEEDED' or transcribe_response.output.task_status == 'FAILED':
    break
  transcribe_response = Transcription.fetch(task=transcribe_response.output.task_id)

if transcribe_response.status_code == HTTPStatus.OK:
  print(json.dumps(transcribe_response.output, indent=4, ensure_ascii=False))
  print('transcription done!')

RESTful API

Use any HTTP library to submit tasks and poll for results. This Python sample demonstrates the workflow:

import requests
import json
import os
import time

# If you have not configured environment variables, replace the following line with your API key: api_key = "sk-xxx"
api_key = os.getenv("DASHSCOPE_API_KEY")
file_urls = [
  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/bjgrbu/hello_world_female_en.wav",
  "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260401/rlrbee/hello_world_male_en.wav",
]

region = "dashscope-intl.aliyuncs.com"

# Submit a file transcription task, including a list of file URLs to be transcribed
def submit_task(apikey, file_urls) -> str:

  headers = {
    "Authorization": f"Bearer {apikey}",
    "Content-Type": "application/json",
    "X-DashScope-Async": "enable",
  }
  data = {
    "model": "fun-asr",
    "input": {"file_urls": file_urls},
    "parameters": {
      "channel_id": [0],
      # "vocabulary_id": "vocab-Xxxx", # Optional, hotword ID.
    },
  }
  # URL of the audio file transcription service
  service_url = (
    f"https://{region}/api/v1/services/audio/asr/transcription"
  )
  response = requests.post(
    service_url, headers=headers, data=json.dumps(data)
  )

  # Print the response content
  if response.status_code == 200:
    return response.json()["output"]["task_id"]
  else:
    print("task failed!")
    print(response.json())
    return None


# Recursively query the task status until it is successful
def wait_for_complete(task_id):
  headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json",
    "X-DashScope-Async": "enable",
  }

  pending = True
  while pending:
    # URL of the task status query service
    service_url = f"https://{region}/api/v1/tasks/{task_id}"
    response = requests.post(
      service_url, headers=headers
    )
    if response.status_code == 200:
      status = response.json()['output']['task_status']
      if status == 'SUCCEEDED':
        print("task succeeded!")
        pending = False
        return response.json()['output']['results']
      elif status == 'RUNNING' or status == 'PENDING':
        pass
      else:
        print("task failed!")
        pending = False
    else:
      print("query failed!")
      pending = False
    print(response.json())
    time.sleep(0.1)


task_id = submit_task(apikey=api_key, file_urls=file_urls)
print("task_id: ", task_id)
result = wait_for_complete(task_id)
print("transcription result: ", result)

Before you begin, get an API key. To use the SDK, install it.

DashScope
OpenAI compatible

Qwen3-ASR-Flash-Filetrans
Qwen3-ASR-Flash

Qwen3-ASR-Flash-Filetrans is designed for asynchronous transcription of audio files and supports recordings up to 12 hours long. This model requires a publicly accessible URL of an audio file as input and does not support direct uploads of local files. It is a non-streaming API that returns the complete recognition result after the task completes.

cURL
Java SDK
Python SDK

When you use cURL for speech recognition, first submit a task to get a task ID (task_id), and then use the ID to retrieve the task result.

Submit a task

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-Async: enable" \
-d '{
  "model": "qwen3-asr-flash-filetrans",
  "input": {
    "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
  },
  "parameters": {
    "channel_id":[
      0
    ],
    "enable_itn": false,
    "enable_words": true
  }
}'

Get the task result

curl -X GET 'https://dashscope-intl.aliyuncs.com/api/v1/tasks/{task_id}' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "X-DashScope-Async: enable" \
-H "Content-Type: application/json"

Complete example

import com.google.gson.Gson;
import com.google.gson.annotations.SerializedName;
import okhttp3.*;

import java.io.IOException;
import java.util.concurrent.TimeUnit;

public class Main {
  private static final String API_URL_SUBMIT = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription";
  private static final String API_URL_QUERY = "https://dashscope-intl.aliyuncs.com/api/v1/tasks/";
  private static final Gson gson = new Gson();

  public static void main(String[] args) {
    // If you have not configured environment variables, replace the following line with: String apiKey = "sk-xxx"
    String apiKey = System.getenv("DASHSCOPE_API_KEY");

    OkHttpClient client = new OkHttpClient();

    // 1. Submit task
    String payloadJson = """
        {
          "model": "qwen3-asr-flash-filetrans",
          "input": {
            "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
          },
          "parameters": {
            "channel_id": [0],
            "enable_itn": false,
            "enable_words": true
          }
        }
        """;

    RequestBody body = RequestBody.create(payloadJson, MediaType.get("application/json; charset=utf-8"));
    Request submitRequest = new Request.Builder()
        .url(API_URL_SUBMIT)
        .addHeader("Authorization", "Bearer " + apiKey)
        .addHeader("Content-Type", "application/json")
        .addHeader("X-DashScope-Async", "enable")
        .post(body)
        .build();

    String taskId = null;

    try (Response response = client.newCall(submitRequest).execute()) {
      if (response.isSuccessful() && response.body() != null) {
        String respBody = response.body().string();
        ApiResponse apiResp = gson.fromJson(respBody, ApiResponse.class);
        if (apiResp.output != null) {
          taskId = apiResp.output.taskId;
          System.out.println("Task submitted. task_id: " + taskId);
        } else {
          System.out.println("Submission response content: " + respBody);
          return;
        }
      } else {
        System.out.println("Task submission failed! HTTP code: " + response.code());
        if (response.body() != null) {
          System.out.println(response.body().string());
        }
        return;
      }
    } catch (IOException e) {
      e.printStackTrace();
      return;
    }

    // 2. Poll task status
    boolean finished = false;
    while (!finished) {
      try {
        TimeUnit.SECONDS.sleep(2);  // Wait for 2 seconds before querying again
      } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        return;
      }

      String queryUrl = API_URL_QUERY + taskId;
      Request queryRequest = new Request.Builder()
          .url(queryUrl)
          .addHeader("Authorization", "Bearer " + apiKey)
          .addHeader("X-DashScope-Async", "enable")
          .addHeader("Content-Type", "application/json")
          .get()
          .build();

      try (Response response = client.newCall(queryRequest).execute()) {
        if (response.body() != null) {
          String queryResponse = response.body().string();
          ApiResponse apiResp = gson.fromJson(queryResponse, ApiResponse.class);

          if (apiResp.output != null && apiResp.output.taskStatus != null) {
            String status = apiResp.output.taskStatus;
            System.out.println("Current task status: " + status);
            if ("SUCCEEDED".equalsIgnoreCase(status)
                || "FAILED".equalsIgnoreCase(status)
                || "UNKNOWN".equalsIgnoreCase(status)) {
              finished = true;
              System.out.println("Task completed. Final result: ");
              System.out.println(queryResponse);
            }
          } else {
            System.out.println("Query response content: " + queryResponse);
          }
        }
      } catch (IOException e) {
        e.printStackTrace();
        return;
      }
    }
  }

  static class ApiResponse {
    @SerializedName("request_id")
    String requestId;
    Output output;
  }

  static class Output {
    @SerializedName("task_id")
    String taskId;
    @SerializedName("task_status")
    String taskStatus;
  }
}

import com.alibaba.dashscope.audio.qwen_asr.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonObject;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;

public class Main {
  public static void main(String[] args) {
    Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
    QwenTranscriptionParam param =
        QwenTranscriptionParam.builder()
            // If you have not configured environment variables, replace the following line with: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .model("qwen3-asr-flash-filetrans")
            .fileUrl("https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav")
            //.parameter("language", "zh")
            //.parameter("channel_id", new ArrayList<String>(){{add("0");add("1");}})
            .parameter("enable_itn", false)
            .parameter("enable_words", true)
            .build();
    try {
      QwenTranscription transcription = new QwenTranscription();
      // Submit the task
      QwenTranscriptionResult result = transcription.asyncCall(param);
      System.out.println("create task result: " + result);
      // Query the task status
      result = transcription.fetch(QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
      System.out.println("task status: " + result);
      // Wait for the task to complete
      result =
          transcription.wait(
              QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
      System.out.println("task result: " + result);
      // Get the speech recognition result
      QwenTranscriptionTaskResult taskResult = result.getResult();
      if (taskResult != null) {
        // Get the URL of the recognition result
        String transcriptionUrl = taskResult.getTranscriptionUrl();
        // Get the result from the URL
        HttpURLConnection connection =
            (HttpURLConnection) new URL(transcriptionUrl).openConnection();
        connection.setRequestMethod("GET");
        connection.connect();
        BufferedReader reader =
            new BufferedReader(new InputStreamReader(connection.getInputStream()));
        // Format and print the JSON result
        Gson gson = new GsonBuilder().setPrettyPrinting().create();
        System.out.println(gson.toJson(gson.fromJson(reader, JsonObject.class)));
      }
    } catch (Exception e) {
      System.out.println("error: " + e);
    }
  }
}

import json
import os
import sys
from http import HTTPStatus

import dashscope
from dashscope.audio.qwen_asr import QwenTranscription
from dashscope.api_entities.dashscope_response import TranscriptionResponse


# run the transcription script
if __name__ == '__main__':
  # If you have not configured environment variables, replace the following line with: dashscope.api_key = "sk-xxx"
  dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

  dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
  task_response = QwenTranscription.async_call(
    model='qwen3-asr-flash-filetrans',
    file_url='https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav',
    #language="",
    enable_itn=False,
    enable_words=True
  )
  print(f'task_response: {task_response}')
  print(task_response.output.task_id)
  query_response = QwenTranscription.fetch(task=task_response.output.task_id)
  print(f'query_response: {query_response}')
  task_result = QwenTranscription.wait(task=task_response.output.task_id)
  print(f'task_result: {task_result}')

Qwen3-ASR-Flash supports recordings up to 5 minutes long. This model accepts a publicly accessible audio file URL or a direct upload of a local file as input. It can also return recognition results as a stream.

Input: Audio file URL

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
  {"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]

response = dashscope.MultiModalConversation.call(
  # If you have not configured environment variables, replace the following line with: api_key = "sk-xxx"
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-asr-flash",
  messages=messages,
  result_format="message",
  asr_options={
    #"language": "zh", # Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
    "enable_itn":False
  }
)
print(response)

Input: Base64-encoded audio file

Input Base64-encoded data (Data URL) in the format: data:<mediatype>;base64,<data>.

<mediatype>: MIME type Varies by audio format, for example:
- WAV: audio/wav
- MP3: audio/mpeg
<data>: Base64-encoded string of the audio Base64 encoding increases file size. Keep the original file small enough so the encoded data stays within the 10 MB input limit.
Example: data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9

See example code

import base64, pathlib

# input.mp3 is a local audio file for voice cloning. Replace it with your own audio file path and ensure it meets audio requirements
file_path = pathlib.Path("input.mp3")
base64_str = base64.b64encode(file_path.read_bytes()).decode()
data_uri = f"data:audio/mpeg;base64,{base64_str}"

The example uses the audio file: welcome.mp3.

import base64
import dashscope
import os
import pathlib

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace with your actual audio file path
file_path = "welcome.mp3"
# Replace with your actual audio file MIME type
audio_mime_type = "audio/mpeg"

file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
  raise FileNotFoundError(f"Audio file not found: {file_path}")

base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"

messages = [
  {"role": "user", "content": [{"audio": data_uri}]}
]
response = dashscope.MultiModalConversation.call(
  # If you have not configured environment variables, replace the following line with: api_key = "sk-xxx",
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-asr-flash",
  messages=messages,
  result_format="message",
  asr_options={
    # "language": "zh", # Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
    "enable_itn":False
  }
)
print(response)

Input: Absolute path to local audio file

When using the DashScope SDK to process local audio files, you must provide the file path. The following table shows how to construct the file path for your specific scenario and operating system.

System	SDK	Input file path	Example
Linux or macOS	Python SDK	`file://<absolute file path>`	`file:///home/audio/test.mp3`
Linux or macOS	Java SDK	`file://<absolute file path>`	`file:///home/audio/test.mp3`
Windows	Python SDK	`file://<absolute file path>`	`file://D:/audio/test.mp3`
Windows	Java SDK	`file:///<absolute file path>`	`file:///D:/audio/test.mp3`

When using local files, the API call limit is 100 QPS and cannot be scaled. Do not use this method in production environments, high-concurrency scenarios, or stress testing. For higher concurrency, upload files to OSS and call the API using the audio file URL.

The example uses the audio file: welcome.mp3.

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path to your local audio file
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"

messages = [
  {"role": "user", "content": [{"audio": audio_file_path}]}
]
response = dashscope.MultiModalConversation.call(
  # If you have not configured environment variables, replace the following line with: api_key = "sk-xxx"
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-asr-flash",
  messages=messages,
  result_format="message",
  asr_options={
    # "language": "zh", # Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
    "enable_itn":False
  }
)
print(response)

Streaming output

The model generates results incrementally rather than all at once. Non-streaming output waits until the model finishes generating and then returns the complete result. Streaming output returns intermediate results in real time, letting you read results as they are generated and reducing wait time. Set parameters differently based on your calling method to enable streaming output:

DashScope Python SDK: Set the stream parameter to true.
DashScope Java SDK: Use the streamCall interface.
DashScope HTTP: Set the X-DashScope-SSE header to enable.

Python SDK
Java SDK
cURL

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
  {"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
  # If you have not configured environment variables, replace the following line with: api_key = "sk-xxx"
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-asr-flash",
  messages=messages,
  result_format="message",
  asr_options={
    # "language": "zh", # Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
    "enable_itn":False
  },
  stream=True
)

for response in response:
  try:
    print(response["output"]["choices"][0]["message"].content[0]["text"])
  except:
    pass

import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;

public class Main {
  public static void simpleMultiModalConversationCall()
      throws ApiException, NoApiKeyException, UploadFileException {
    MultiModalConversation conv = new MultiModalConversation();
    MultiModalMessage userMessage = MultiModalMessage.builder()
        .role(Role.USER.getValue())
        .content(Arrays.asList(
        Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
        .build();

    MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
        .build();

    Map<String, Object> asrOptions = new HashMap<>();
    asrOptions.put("enable_itn", false);
    // asrOptions.put("language", "zh"); // Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
    MultiModalConversationParam param = MultiModalConversationParam.builder()
        // If you have not configured environment variables, replace the following line with: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .model("qwen3-asr-flash")
        .message(sysMessage)
        .message(userMessage)
        .parameter("asr_options", asrOptions)
        .build();
    Flowable<MultiModalConversationResult> resultFlowable = conv.streamCall(param);
    resultFlowable.blockingForEach(item -> {
      try {
        System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
      } catch (Exception e){
        System.exit(0);
      }
    });
  }

  public static void main(String[] args) {
    try {
      Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
      simpleMultiModalConversationCall();
    } catch (ApiException | NoApiKeyException | UploadFileException e) {
      System.out.println(e.getMessage());
    }
    System.exit(0);
  }
}

curl -X POST "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
  "model": "qwen3-asr-flash",
  "input": {
    "messages": [
      {
        "content": [
          {
            "text": ""
          }
        ],
        "role": "system"
      },
      {
        "content": [
          {
            "audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
          }
        ],
        "role": "user"
      }
    ]
  },
  "parameters": {
    "incremental_output": true,
    "asr_options": {
      "enable_itn": false
    }
  }
}'

Only the Qwen3-ASR-Flash series models support OpenAI-compatible calls. OpenAI-compatible mode only accepts publicly accessible audio file URLs and does not support local file paths.Use OpenAI Python SDK version 1.52.0 or later, and Node.js SDK version 4.68.0 or later.The asr_options parameter is not part of the OpenAI standard. When using the OpenAI SDK, pass it through extra_body.

Input: Audio file URL

Python SDK
Node.js SDK
cURL

from openai import OpenAI
import os

try:
  client = OpenAI(
    # If you have not configured environment variables, replace the following line with: api_key = "sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
  )


  stream_enabled = False  # Enable streaming output
  completion = client.chat.completions.create(
    model="qwen3-asr-flash",
    messages=[
      {
        "content": [
          {
            "type": "input_audio",
            "input_audio": {
              "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
            }
          }
        ],
        "role": "user"
      }
    ],
    stream=stream_enabled,
    # Do not set stream_options when stream is False
    # stream_options={"include_usage": True},
    extra_body={
      "asr_options": {
        # "language": "zh",
        "enable_itn": False
      }
    }
  )
  if stream_enabled:
    full_content = ""
    print("Streaming output:")
    for chunk in completion:
      # If stream_options.include_usage is True, the last chunk's choices field is an empty list and should be skipped (token usage can be obtained via chunk.usage)
      print(chunk)
      if chunk.choices and chunk.choices[0].delta.content:
        full_content += chunk.choices[0].delta.content
    print(f"Full content: {full_content}")
  else:
    print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
  print(f"Error: {e}")

// Preparation before running:
// Works on Windows/Mac/Linux:
// 1. Ensure Node.js is installed (version >= 14 recommended)
// 2. Run this command to install dependencies: npm install openai

import OpenAI from "openai";

const client = new OpenAI({
  // If you have not configured environment variables, replace the following line with: apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

async function main() {
  try {
    const streamEnabled = false; // Enable streaming output
    const completion = await client.chat.completions.create({
      model: "qwen3-asr-flash",
      messages: [
        {
          role: "user",
          content: [
            {
              type: "input_audio",
              input_audio: {
                data: "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
              }
            }
          ]
        }
      ],
      stream: streamEnabled,
      // Do not set stream_options when stream is False
      // stream_options: {
      //   "include_usage": true
      // },
      extra_body: {
        asr_options: {
          // language: "zh",
          enable_itn: false
        }
      }
    });

    if (streamEnabled) {
      let fullContent = "";
      console.log("Streaming output:");
      for await (const chunk of completion) {
        console.log(JSON.stringify(chunk));
        if (chunk.choices && chunk.choices.length > 0) {
          const delta = chunk.choices[0].delta;
          if (delta && delta.content) {
            fullContent += delta.content;
          }
        }
      }
      console.log(`Full content: ${fullContent}`);
    } else {
      console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
    }
  } catch (err) {
    console.error(`Error: ${err}`);
  }
}

main();

curl -X POST 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen3-asr-flash",
  "messages": [
    {
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
          }
        }
      ],
      "role": "user"
    }
  ],
  "stream":false,
  "asr_options": {
    "enable_itn": false
  }
}'

Input: Base64-encoded audio file

Input Base64-encoded data (Data URL) in the format: data:<mediatype>;base64,<data>.

<mediatype>: MIME type Varies by audio format, for example:
- WAV: audio/wav
- MP3: audio/mpeg
<data>: Base64-encoded string of the audio Base64 encoding increases file size. Keep the original file small enough so the encoded data stays within the 10 MB input limit.
Example: data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9

See example code

import base64, pathlib

# input.mp3 is a local audio file for voice cloning. Replace it with your own audio file path and ensure it meets audio requirements
file_path = pathlib.Path("input.mp3")
base64_str = base64.b64encode(file_path.read_bytes()).decode()
data_uri = f"data:audio/mpeg;base64,{base64_str}"

Python SDK
Node.js SDK

The example uses the audio file: welcome.mp3.

import base64
from openai import OpenAI
import os
import pathlib

try:
  # Replace with your actual audio file path
  file_path = "welcome.mp3"
  # Replace with your actual audio file MIME type
  audio_mime_type = "audio/mpeg"

  file_path_obj = pathlib.Path(file_path)
  if not file_path_obj.exists():
    raise FileNotFoundError(f"Audio file not found: {file_path}")

  base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
  data_uri = f"data:{audio_mime_type};base64,{base64_str}"

  client = OpenAI(
    # If you have not configured environment variables, replace the following line with: api_key = "sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
  )


  stream_enabled = False  # Enable streaming output
  completion = client.chat.completions.create(
    model="qwen3-asr-flash",
    messages=[
      {
        "content": [
          {
            "type": "input_audio",
            "input_audio": {
              "data": data_uri
            }
          }
        ],
        "role": "user"
      }
    ],
    stream=stream_enabled,
    # Do not set stream_options when stream is False
    # stream_options={"include_usage": True},
    extra_body={
      "asr_options": {
        # "language": "zh",
        "enable_itn": False
      }
    }
  )
  if stream_enabled:
    full_content = ""
    print("Streaming output:")
    for chunk in completion:
      # If stream_options.include_usage is True, the last chunk's choices field is an empty list and should be skipped (token usage can be obtained via chunk.usage)
      print(chunk)
      if chunk.choices and chunk.choices[0].delta.content:
        full_content += chunk.choices[0].delta.content
    print(f"Full content: {full_content}")
  else:
    print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
  print(f"Error: {e}")

The example uses the audio file: welcome.mp3.

// Preparation before running:
// Works on Windows/Mac/Linux:
// 1. Ensure Node.js is installed (version >= 14 recommended)
// 2. Run this command to install dependencies: npm install openai

import OpenAI from "openai";
import { readFileSync } from 'fs';

const client = new OpenAI({
  // If you have not configured environment variables, replace the following line with: apiKey: "sk-xxx",
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});

const encodeAudioFile = (audioFilePath) => {
    const audioFile = readFileSync(audioFilePath);
    return audioFile.toString('base64');
};

// Replace with your actual audio file path
const dataUri = `data:audio/mpeg;base64,${encodeAudioFile("welcome.mp3")}`;

async function main() {
  try {
    const streamEnabled = false; // Enable streaming output
    const completion = await client.chat.completions.create({
      model: "qwen3-asr-flash",
      messages: [
        {
          role: "user",
          content: [
            {
              type: "input_audio",
              input_audio: {
                data: dataUri
              }
            }
          ]
        }
      ],
      stream: streamEnabled,
      // Do not set stream_options when stream is False
      // stream_options: {
      //   "include_usage": true
      // },
      extra_body: {
        asr_options: {
          // language: "zh",
          enable_itn: false
        }
      }
    });

    if (streamEnabled) {
      let fullContent = "";
      console.log("Streaming output:");
      for await (const chunk of completion) {
        console.log(JSON.stringify(chunk));
        if (chunk.choices && chunk.choices.length > 0) {
          const delta = chunk.choices[0].delta;
          if (delta && delta.content) {
            fullContent += delta.content;
          }
        }
      }
      console.log(`Full content: ${fullContent}`);
    } else {
      console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
    }
  } catch (err) {
    console.error(`Error: ${err}`);
  }
}

main();

Use Qwen-Omni (qwen3.5-omni-plus, qwen3.5-omni-flash, qwen3-omni-flash) for file transcription with prompt-based context. This approach allows you to describe your domain in the system prompt for improved accuracy.

Qwen-Omni interprets all audio, not just speech. Music, typing, or ambient noise may produce descriptions instead of transcription. For mixed audio, preprocess with VAD to isolate speech, or add a system prompt instruction: "Transcribe only human speech. Ignore non-speech sounds."

Get an API key and set it as an environment variable. Install the SDK.

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-flash",
  messages=[
    {"role": "system", "content": "Transcribe the following audio exactly as spoken. Output only the transcription text. Ignore non-speech sounds."},
    {"role": "user", "content": [
      {"type": "input_audio", "input_audio": {"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav"}},
      {"type": "text", "text": "Transcribe this audio."}
    ]}
  ],
  modalities=["text"],
  stream=True,
)

for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")

For full Qwen-Omni capabilities including multimodal conversation, see Audio and video file understanding.

Compare models

Feature	Fun-ASR
Supported languages	Varies by model: fun-asr, fun-asr-2025-11-07, fun-asr-mtl, fun-asr-mtl-2025-08-25: Chinese (Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, Jin; also supports accents from Central Plains, Southwest, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeast, Beijing, Hong Kong, and Taiwan, including official dialects from regions such as Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia), English, Japanese, Korean, Vietnamese, Indonesian, Thai, Malay, Filipino, Arabic, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, Greek, Hindi, Hungarian, Irish, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish. fun-asr-2025-08-25: Chinese (Mandarin), English
Supported audio formats	aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv
Sample rate	Any
Sound channels	Any
Input format	Publicly accessible URLs of files to be recognized. Up to 100 audio files are supported.
Audio size/duration	Each audio file must be no larger than 2 GB and no longer than 12 hours.
Emotion recognition	Not supported
Timestamp	Supported (always on)
Punctuation prediction	Supported (always on)
Hotwords	Supported. The hotword feature is supported only in the primary workspace and is not available in sub-workspaces.
ITN	Supported (always on)
Singing voice recognition	Supported (fun-asr and fun-asr-2025-11-07 only)
Noise rejection	Supported (always on)
Sensitive word filtering	Supported (filters content from the Qwen Cloud sensitive word list by default)
Speaker diarization	Supported (off by default, can be enabled)
Filler word filtering	Not supported
VAD	Supported (always on)
Rate limiting (RPS)	Job submission API: 10, Task query API: 20
Connection types	DashScope: Java/Python SDK, RESTful API
Pricing	International: $0.000035/second

API reference

Fun-ASR
Qwen-ASR
Qwen-Omni

FAQ

Fun-ASR
Qwen-ASR
Qwen-Omni

How can I improve recognition accuracy?

Several factors affect accuracy. Review each and apply the corresponding optimization.Key factors:

Sound quality: Recording device quality, sample rate, and ambient noise directly affect clarity. High-quality audio input is essential.
Speaker characteristics: Variations in pitch, speech rate, accent, and dialect increase recognition difficulty, especially for rare dialects or heavy accents.
Language and vocabulary: Mixed languages, technical terms, or slang increase recognition difficulty. Configure hotwords to improve accuracy for domain-specific terms.
Contextual understanding: Insufficient context can cause semantic ambiguity, especially in situations where surrounding context is needed for correct recognition.

Optimization methods:

Optimize audio quality: Use high-performance microphones at the recommended sample rate. Minimize ambient noise and echo.
Adapt to the speaker: For audio with strong accents or dialects, select a model that supports those specific dialects.
Configure hotwords: Set hotwords for technical terms, proper nouns, and other specific words. For more information, see Customize hotwords.
Preserve context: Avoid splitting audio into excessively short clips.

Q: How do I provide a publicly accessible audio URL for the API?

We recommend using Object Storage Service (OSS), which provides highly available and reliable storage and easily generates public URLs.Verify your URL is publicly accessible: Open the URL in a browser or use curl to ensure the audio file downloads or plays successfully (HTTP status code 200).

Q: How do I check if my audio format meets requirements?

Use the open-source tool ffprobe to quickly get detailed audio information:

# Check container format (format_name), codec (codec_name), sample rate (sample_rate), and number of channels (channels)
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 your_audio_file.mp3

Q: How do I process audio to meet model requirements?

Use the open-source tool FFmpeg to trim or convert audio formats:

Audio trimming: Extract a segment from a long audio file

# -i: Input file
# -ss 00:01:30: Start time (1 minute 30 seconds)
# -t 00:02:00: Duration (2 minutes)
# -c copy: Copy audio stream without re-encoding (fast)
# output_clip.wav: Output file
ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav

Format conversion For example, convert any audio to 16 kHz, 16-bit, mono WAV

# -i: Input file
# -ac 1: Set to 1 channel (mono)
# -ar 16000: Set sample rate to 16000 Hz (16 kHz)
# -sample_fmt s16: Set sample format to 16-bit signed integer PCM
# output.wav: Output file
ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav

Q: When should I use Qwen-Omni instead of dedicated ASR models?

Use Qwen-Omni when:

Domain-specific terminology: You need to transcribe audio with specialized vocabulary. Describe your domain in the system prompt.
Context-aware transcription: You want to provide conversation context for improved accuracy.
Multimodal understanding: You need to process audio alongside images or video.
OpenAI compatibility: You prefer using the OpenAI-compatible API.

Use dedicated ASR models (Fun-ASR, Qwen-ASR) when:

Lower latency: Dedicated ASR models have lower per-request latency.
Hotwords: You need hotword support (only available in Fun-ASR).
Speaker diarization: You need to identify different speakers (only available in Fun-ASR).
Long audio files: Fun-ASR supports up to 12 hours of audio.

Q: How do I improve transcription accuracy with Qwen-Omni?

Use the system prompt to provide context:

messages=[
  {"role": "system", "content": "You are transcribing a medical consultation. Key terms: diabetes, hypertension, metformin."},
  {"role": "user", "content": [{"type": "input_audio", "input_audio": {"data": "..."}}, {"type": "text", "text": "Transcribe this audio."}]}
]

​Core features

​Supported models

​Getting started

Model availability

Make your first call

Async submit and sync wait

Async submit and async query

RESTful API

​Compare models

​API reference

​FAQ

How can I improve recognition accuracy?

Core features

Supported models

Getting started

Compare models

API reference

FAQ