Convert files to text
Qwen Cloud offers three model families for audio file transcription: Fun-ASR for high-accuracy multilingual transcription with singing recognition, Qwen-ASR for recognition with enhanced semantic understanding, and Qwen-Omni for prompt-based transcription with contextual understanding.
For model availability, supported languages, and feature comparison, see Speech-to-text models.
Getting started
- Fun-ASR
- Qwen-ASR
- Qwen-Omni
The following sections provide sample code for API calls.Get an API key and set it as an environment variable. To use the SDK, install it.Because audio and video files are often large, file transfer and speech recognition can take a long time. The file recognition API uses asynchronous invocation to submit tasks. After the file recognition is complete, you must use the query API to retrieve the speech recognition results.
Copy
from http import HTTPStatus
from dashscope.audio.asr import Transcription
from urllib import request
import dashscope
import os
import json
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
task_response = Transcription.async_call(
model='fun-asr',
file_urls=['https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/hello_world_female2.wav',
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/hello_world_male2.wav'],
language_hints=['zh', 'en'] # language_hints is an optional parameter that specifies the language code of the audio to be recognized. For the value range, see the API reference.
)
transcription_response = Transcription.wait(task=task_response.output.task_id)
if transcription_response.status_code == HTTPStatus.OK:
for transcription in transcription_response.output['results']:
if transcription['subtask_status'] == 'SUCCEEDED':
url = transcription['transcription_url']
result = json.loads(request.urlopen(url).read().decode('utf8'))
print(json.dumps(result, indent=4,
ensure_ascii=False))
else:
print('transcription failed!')
print(transcription)
else:
print('Error: ', transcription_response.output.message)
The complete recognition result is printed to the console in JSON format. The result includes the transcribed text and the start and end times of the text in the audio or video file, specified in milliseconds.
The complete recognition result is printed to the console in JSON format. The result includes the transcribed text and the start and end times of the text in the audio or video file, specified in milliseconds.
First resultSecond result
Copy
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/hello_world_female2.wav",
"properties": {
"audio_format": "pcm_s16le",
"channels": [
0
],
"original_sampling_rate": 16000,
"original_duration_in_milliseconds": 3834
},
"transcripts": [
{
"channel_id": 0,
"content_duration_in_milliseconds": 2480,
"text": "Hello World, this is Alibaba Speech Lab.",
"sentences": [
{
"begin_time": 760,
"end_time": 3240,
"text": "Hello World, this is Alibaba Speech Lab.",
"sentence_id": 1,
"words": [
{
"begin_time": 760,
"end_time": 1000,
"text": "Hello",
"punctuation": ""
},
{
"begin_time": 1000,
"end_time": 1120,
"text": " World",
"punctuation": ", "
},
{
"begin_time": 1400,
"end_time": 1920,
"text": "this is",
"punctuation": ""
},
{
"begin_time": 1920,
"end_time": 2520,
"text": "Alibaba",
"punctuation": ""
},
{
"begin_time": 2520,
"end_time": 2840,
"text": "Speech",
"punctuation": ""
},
{
"begin_time": 2840,
"end_time": 3240,
"text": "Lab",
"punctuation": "."
}
]
}
]
}
]
}
Copy
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/hello_world_male2.wav",
"properties": {
"audio_format": "pcm_s16le",
"channels": [
0
],
"original_sampling_rate": 16000,
"original_duration_in_milliseconds": 4726
},
"transcripts": [
{
"channel_id": 0,
"content_duration_in_milliseconds": 3800,
"text": "Hello World, this is Alibaba Speech Lab.",
"sentences": [
{
"begin_time": 680,
"end_time": 4480,
"text": "Hello World, this is Alibaba Speech Lab.",
"sentence_id": 1,
"words": [
{
"begin_time": 680,
"end_time": 960,
"text": "Hello",
"punctuation": ""
},
{
"begin_time": 960,
"end_time": 1080,
"text": " World",
"punctuation": ", "
},
{
"begin_time": 1480,
"end_time": 2160,
"text": "this is",
"punctuation": ""
},
{
"begin_time": 2160,
"end_time": 3080,
"text": "Alibaba",
"punctuation": ""
},
{
"begin_time": 3080,
"end_time": 3520,
"text": "Speech",
"punctuation": ""
},
{
"begin_time": 3520,
"end_time": 4480,
"text": "Lab",
"punctuation": "."
}
]
}
]
}
]
}
Before you begin, get an API key. To use the SDK, install it.
- DashScope
- OpenAI compatible
- Qwen3-ASR-Flash-Filetrans
- Qwen3-ASR-Flash
Qwen3-ASR-Flash-Filetrans is designed for asynchronous transcription of audio files and supports recordings up to 12 hours long. This model requires a publicly accessible URL of an audio file as input and does not support direct uploads of local files. It is a non-streaming API that returns the complete recognition result after the task completes.
- cURL
- Java SDK
- Python SDK
When you use cURL for speech recognition, first submit a task to get a task ID (task_id), and then use the ID to retrieve the task result.
Submit a task
Copy
curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-Async: enable" \
-d '{
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id":[
0
],
"enable_itn": false,
"enable_words": true
}
}'
Get the task result
Copy
curl -X GET 'https://dashscope-intl.aliyuncs.com/api/v1/tasks/{task_id}' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "X-DashScope-Async: enable" \
-H "Content-Type: application/json"
Complete example
Copy
import com.google.gson.Gson;
import com.google.gson.annotations.SerializedName;
import okhttp3.*;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
public class Main {
private static final String API_URL_SUBMIT = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/asr/transcription";
private static final String API_URL_QUERY = "https://dashscope-intl.aliyuncs.com/api/v1/tasks/";
private static final Gson gson = new Gson();
public static void main(String[] args) {
// If you have not configured environment variables, replace the following line with: String apiKey = "sk-xxx"
String apiKey = System.getenv("DASHSCOPE_API_KEY");
OkHttpClient client = new OkHttpClient();
// 1. Submit task
String payloadJson = """
{
"model": "qwen3-asr-flash-filetrans",
"input": {
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
},
"parameters": {
"channel_id": [0],
"enable_itn": false,
"enable_words": true
}
}
""";
RequestBody body = RequestBody.create(payloadJson, MediaType.get("application/json; charset=utf-8"));
Request submitRequest = new Request.Builder()
.url(API_URL_SUBMIT)
.addHeader("Authorization", "Bearer " + apiKey)
.addHeader("Content-Type", "application/json")
.addHeader("X-DashScope-Async", "enable")
.post(body)
.build();
String taskId = null;
try (Response response = client.newCall(submitRequest).execute()) {
if (response.isSuccessful() && response.body() != null) {
String respBody = response.body().string();
ApiResponse apiResp = gson.fromJson(respBody, ApiResponse.class);
if (apiResp.output != null) {
taskId = apiResp.output.taskId;
System.out.println("Task submitted. task_id: " + taskId);
} else {
System.out.println("Submission response content: " + respBody);
return;
}
} else {
System.out.println("Task submission failed! HTTP code: " + response.code());
if (response.body() != null) {
System.out.println(response.body().string());
}
return;
}
} catch (IOException e) {
e.printStackTrace();
return;
}
// 2. Poll task status
boolean finished = false;
while (!finished) {
try {
TimeUnit.SECONDS.sleep(2); // Wait for 2 seconds before querying again
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return;
}
String queryUrl = API_URL_QUERY + taskId;
Request queryRequest = new Request.Builder()
.url(queryUrl)
.addHeader("Authorization", "Bearer " + apiKey)
.addHeader("X-DashScope-Async", "enable")
.addHeader("Content-Type", "application/json")
.get()
.build();
try (Response response = client.newCall(queryRequest).execute()) {
if (response.body() != null) {
String queryResponse = response.body().string();
ApiResponse apiResp = gson.fromJson(queryResponse, ApiResponse.class);
if (apiResp.output != null && apiResp.output.taskStatus != null) {
String status = apiResp.output.taskStatus;
System.out.println("Current task status: " + status);
if ("SUCCEEDED".equalsIgnoreCase(status)
|| "FAILED".equalsIgnoreCase(status)
|| "UNKNOWN".equalsIgnoreCase(status)) {
finished = true;
System.out.println("Task completed. Final result: ");
System.out.println(queryResponse);
}
} else {
System.out.println("Query response content: " + queryResponse);
}
}
} catch (IOException e) {
e.printStackTrace();
return;
}
}
}
static class ApiResponse {
@SerializedName("request_id")
String requestId;
Output output;
}
static class Output {
@SerializedName("task_id")
String taskId;
@SerializedName("task_status")
String taskStatus;
}
}
Copy
import com.alibaba.dashscope.audio.qwen_asr.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonObject;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;
public class Main {
public static void main(String[] args) {
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
QwenTranscriptionParam param =
QwenTranscriptionParam.builder()
// If you have not configured environment variables, replace the following line with: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen3-asr-flash-filetrans")
.fileUrl("https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav")
//.parameter("language", "zh")
//.parameter("channel_id", new ArrayList<String>(){{add("0");add("1");}})
.parameter("enable_itn", false)
.parameter("enable_words", true)
.build();
try {
QwenTranscription transcription = new QwenTranscription();
// Submit the task
QwenTranscriptionResult result = transcription.asyncCall(param);
System.out.println("create task result: " + result);
// Query the task status
result = transcription.fetch(QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
System.out.println("task status: " + result);
// Wait for the task to complete
result =
transcription.wait(
QwenTranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
System.out.println("task result: " + result);
// Get the speech recognition result
QwenTranscriptionTaskResult taskResult = result.getResult();
if (taskResult != null) {
// Get the URL of the recognition result
String transcriptionUrl = taskResult.getTranscriptionUrl();
// Get the result from the URL
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
// Format and print the JSON result
Gson gson = new GsonBuilder().setPrettyPrinting().create();
System.out.println(gson.toJson(gson.fromJson(reader, JsonObject.class)));
}
} catch (Exception e) {
System.out.println("error: " + e);
}
}
}
Copy
import json
import os
import sys
from http import HTTPStatus
import dashscope
from dashscope.audio.qwen_asr import QwenTranscription
from dashscope.api_entities.dashscope_response import TranscriptionResponse
# run the transcription script
if __name__ == '__main__':
# If you have not configured environment variables, replace the following line with: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
task_response = QwenTranscription.async_call(
model='qwen3-asr-flash-filetrans',
file_url='https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav',
#language="",
enable_itn=False,
enable_words=True
)
print(f'task_response: {task_response}')
print(task_response.output.task_id)
query_response = QwenTranscription.fetch(task=task_response.output.task_id)
print(f'query_response: {query_response}')
task_result = QwenTranscription.wait(task=task_response.output.task_id)
print(f'task_result: {task_result}')
Qwen3-ASR-Flash supports recordings up to 5 minutes long. This model accepts a publicly accessible audio file URL or a direct upload of a local file as input. It can also return recognition results as a stream.
The example uses the audio file: welcome.mp3.
The example uses the audio file: welcome.mp3.
Input: Audio file URL
Copy
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
# If you have not configured environment variables, replace the following line with: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
#"language": "zh", # Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
"enable_itn":False
}
)
print(response)
Input: Base64-encoded audio file
Input Base64-encoded data (Data URL) in the format:data:<mediatype>;base64,<data>.-
<mediatype>: MIME type Varies by audio format, for example:- WAV:
audio/wav - MP3:
audio/mpeg
- WAV:
-
<data>: Base64-encoded string of the audio Base64 encoding increases file size. Keep the original file small enough so the encoded data stays within the 10 MB input limit. -
Example:
data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9
See example code
See example code
Copy
import base64, pathlib
# input.mp3 is a local audio file for voice cloning. Replace it with your own audio file path and ensure it meets audio requirements
file_path = pathlib.Path("input.mp3")
base64_str = base64.b64encode(file_path.read_bytes()).decode()
data_uri = f"data:audio/mpeg;base64,{base64_str}"
Copy
import base64
import dashscope
import os
import pathlib
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# Replace with your actual audio file path
file_path = "welcome.mp3"
# Replace with your actual audio file MIME type
audio_mime_type = "audio/mpeg"
file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"
messages = [
{"role": "user", "content": [{"audio": data_uri}]}
]
response = dashscope.MultiModalConversation.call(
# If you have not configured environment variables, replace the following line with: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
"enable_itn":False
}
)
print(response)
Input: Absolute path to local audio file
When using the DashScope SDK to process local image files, you must provide the file path. The following table shows how to construct the file path for your specific scenario and operating system.| System | SDK | Input file path | Example |
|---|---|---|---|
| Linux or macOS | Python SDK | file://<absolute file path> | file:///home/images/test.png |
| Linux or macOS | Java SDK | file://<absolute file path> | file:///home/images/test.png |
| Windows | Python SDK | file://<absolute file path> | file://D:/images/test.png |
| Windows | Java SDK | file:///<absolute file path> | file:///D:images/test.png |
When using local files, the API call limit is 100 QPS and cannot be scaled. Do not use this method in production environments, high-concurrency scenarios, or stress testing. For higher concurrency, upload files to OSS and call the API using the audio file URL.
Copy
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
# Replace ABSOLUTE_PATH/welcome.mp3 with the absolute path to your local audio file
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
{"role": "user", "content": [{"audio": audio_file_path}]}
]
response = dashscope.MultiModalConversation.call(
# If you have not configured environment variables, replace the following line with: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
"enable_itn":False
}
)
print(response)
Streaming output
The model generates results incrementally rather than all at once. Non-streaming output waits until the model finishes generating and then returns the complete result. Streaming output returns intermediate results in real time, letting you read results as they are generated and reducing wait time. Set parameters differently based on your calling method to enable streaming output:- DashScope Python SDK: Set the
streamparameter to true. - DashScope Java SDK: Use the
streamCallinterface. - DashScope HTTP: Set the
X-DashScope-SSEheader toenable.
- Python SDK
- Java SDK
- cURL
Copy
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{"role": "user", "content": [{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}]}
]
response = dashscope.MultiModalConversation.call(
# If you have not configured environment variables, replace the following line with: api_key = "sk-xxx"
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-asr-flash",
messages=messages,
result_format="message",
asr_options={
# "language": "zh", # Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
"enable_itn":False
},
stream=True
)
for response in response:
try:
print(response["output"]["choices"][0]["message"].content[0]["text"])
except:
pass
Copy
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.Flowable;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
.build();
MultiModalMessage sysMessage = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
.build();
Map<String, Object> asrOptions = new HashMap<>();
asrOptions.put("enable_itn", false);
// asrOptions.put("language", "zh"); // Optional. If the audio language is known, you can specify it using this parameter to improve recognition accuracy.
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If you have not configured environment variables, replace the following line with: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen3-asr-flash")
.message(sysMessage)
.message(userMessage)
.parameter("asr_options", asrOptions)
.build();
Flowable<MultiModalConversationResult> resultFlowable = conv.streamCall(param);
resultFlowable.blockingForEach(item -> {
try {
System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
Copy
curl -X POST "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
"model": "qwen3-asr-flash",
"input": {
"messages": [
{
"content": [
{
"text": ""
}
],
"role": "system"
},
{
"content": [
{
"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
],
"role": "user"
}
]
},
"parameters": {
"incremental_output": true,
"asr_options": {
"enable_itn": false
}
}
}'
Only the Qwen3-ASR-Flash series models support OpenAI-compatible calls. OpenAI-compatible mode only accepts publicly accessible audio file URLs and does not support local file paths.Use OpenAI Python SDK version 1.52.0 or later, and Node.js SDK version 4.68.0 or later.The
asr_options parameter is not part of the OpenAI standard. When using the OpenAI SDK, pass it through extra_body.Input: Audio file URL
- Python SDK
- Node.js SDK
- cURL
Copy
from openai import OpenAI
import os
try:
client = OpenAI(
# If you have not configured environment variables, replace the following line with: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
stream_enabled = False # Enable streaming output
completion = client.chat.completions.create(
model="qwen3-asr-flash",
messages=[
{
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
}
],
"role": "user"
}
],
stream=stream_enabled,
# Do not set stream_options when stream is False
# stream_options={"include_usage": True},
extra_body={
"asr_options": {
# "language": "zh",
"enable_itn": False
}
}
)
if stream_enabled:
full_content = ""
print("Streaming output:")
for chunk in completion:
# If stream_options.include_usage is True, the last chunk's choices field is an empty list and should be skipped (token usage can be obtained via chunk.usage)
print(chunk)
if chunk.choices and chunk.choices[0].delta.content:
full_content += chunk.choices[0].delta.content
print(f"Full content: {full_content}")
else:
print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
print(f"Error: {e}")
Copy
// Preparation before running:
// Works on Windows/Mac/Linux:
// 1. Ensure Node.js is installed (version >= 14 recommended)
// 2. Run this command to install dependencies: npm install openai
import OpenAI from "openai";
const client = new OpenAI({
// If you have not configured environment variables, replace the following line with: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});
async function main() {
try {
const streamEnabled = false; // Enable streaming output
const completion = await client.chat.completions.create({
model: "qwen3-asr-flash",
messages: [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
}
]
}
],
stream: streamEnabled,
// Do not set stream_options when stream is False
// stream_options: {
// "include_usage": true
// },
extra_body: {
asr_options: {
// language: "zh",
enable_itn: false
}
}
});
if (streamEnabled) {
let fullContent = "";
console.log("Streaming output:");
for await (const chunk of completion) {
console.log(JSON.stringify(chunk));
if (chunk.choices && chunk.choices.length > 0) {
const delta = chunk.choices[0].delta;
if (delta && delta.content) {
fullContent += delta.content;
}
}
}
console.log(`Full content: ${fullContent}`);
} else {
console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
}
} catch (err) {
console.error(`Error: ${err}`);
}
}
main();
Copy
curl -X POST 'https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-asr-flash",
"messages": [
{
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
}
}
],
"role": "user"
}
],
"stream":false,
"asr_options": {
"enable_itn": false
}
}'
Input: Base64-encoded audio file
Input Base64-encoded data (Data URL) in the format:data:<mediatype>;base64,<data>.-
<mediatype>: MIME type Varies by audio format, for example:- WAV:
audio/wav - MP3:
audio/mpeg
- WAV:
-
<data>: Base64-encoded string of the audio Base64 encoding increases file size. Keep the original file small enough so the encoded data stays within the 10 MB input limit. -
Example:
data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9
See example code
See example code
Copy
import base64, pathlib
# input.mp3 is a local audio file for voice cloning. Replace it with your own audio file path and ensure it meets audio requirements
file_path = pathlib.Path("input.mp3")
base64_str = base64.b64encode(file_path.read_bytes()).decode()
data_uri = f"data:audio/mpeg;base64,{base64_str}"
- Python SDK
- Node.js SDK
The example uses the audio file: welcome.mp3.
Copy
import base64
from openai import OpenAI
import os
import pathlib
try:
# Replace with your actual audio file path
file_path = "welcome.mp3"
# Replace with your actual audio file MIME type
audio_mime_type = "audio/mpeg"
file_path_obj = pathlib.Path(file_path)
if not file_path_obj.exists():
raise FileNotFoundError(f"Audio file not found: {file_path}")
base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:{audio_mime_type};base64,{base64_str}"
client = OpenAI(
# If you have not configured environment variables, replace the following line with: api_key = "sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
stream_enabled = False # Enable streaming output
completion = client.chat.completions.create(
model="qwen3-asr-flash",
messages=[
{
"content": [
{
"type": "input_audio",
"input_audio": {
"data": data_uri
}
}
],
"role": "user"
}
],
stream=stream_enabled,
# Do not set stream_options when stream is False
# stream_options={"include_usage": True},
extra_body={
"asr_options": {
# "language": "zh",
"enable_itn": False
}
}
)
if stream_enabled:
full_content = ""
print("Streaming output:")
for chunk in completion:
# If stream_options.include_usage is True, the last chunk's choices field is an empty list and should be skipped (token usage can be obtained via chunk.usage)
print(chunk)
if chunk.choices and chunk.choices[0].delta.content:
full_content += chunk.choices[0].delta.content
print(f"Full content: {full_content}")
else:
print(f"Non-streaming output: {completion.choices[0].message.content}")
except Exception as e:
print(f"Error: {e}")
The example uses the audio file: welcome.mp3.
Copy
// Preparation before running:
// Works on Windows/Mac/Linux:
// 1. Ensure Node.js is installed (version >= 14 recommended)
// 2. Run this command to install dependencies: npm install openai
import OpenAI from "openai";
import { readFileSync } from 'fs';
const client = new OpenAI({
// If you have not configured environment variables, replace the following line with: apiKey: "sk-xxx",
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
});
const encodeAudioFile = (audioFilePath) => {
const audioFile = readFileSync(audioFilePath);
return audioFile.toString('base64');
};
// Replace with your actual audio file path
const dataUri = `data:audio/mpeg;base64,${encodeAudioFile("welcome.mp3")}`;
async function main() {
try {
const streamEnabled = false; // Enable streaming output
const completion = await client.chat.completions.create({
model: "qwen3-asr-flash",
messages: [
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: dataUri
}
}
]
}
],
stream: streamEnabled,
// Do not set stream_options when stream is False
// stream_options: {
// "include_usage": true
// },
extra_body: {
asr_options: {
// language: "zh",
enable_itn: false
}
}
});
if (streamEnabled) {
let fullContent = "";
console.log("Streaming output:");
for await (const chunk of completion) {
console.log(JSON.stringify(chunk));
if (chunk.choices && chunk.choices.length > 0) {
const delta = chunk.choices[0].delta;
if (delta && delta.content) {
fullContent += delta.content;
}
}
}
console.log(`Full content: ${fullContent}`);
} else {
console.log(`Non-streaming output: ${completion.choices[0].message.content}`);
}
} catch (err) {
console.error(`Error: ${err}`);
}
}
main();
Use Qwen-Omni (Get an API key and set it as an environment variable. Install the SDK.
qwen3-omni-flash) for file transcription with prompt-based context. This approach allows you to describe your domain in the system prompt for improved accuracy.Qwen-Omni interprets all audio, not just speech. Music, typing, or ambient noise may produce descriptions instead of transcription. For mixed audio, preprocess with VAD to isolate speech, or add a system prompt instruction: "Transcribe only human speech. Ignore non-speech sounds."
Copy
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen3-omni-flash",
messages=[
{"role": "system", "content": "Transcribe the following audio exactly as spoken. Output only the transcription text. Ignore non-speech sounds."},
{"role": "user", "content": [
{"type": "input_audio", "input_audio": {"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav"}},
{"type": "text", "text": "Transcribe this audio."}
]}
],
modalities=["text"],
stream=True,
)
for chunk in completion:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
For full Qwen-Omni capabilities including multimodal conversation, see Audio and video file understanding.
API reference
- Fun-ASR
- Qwen-ASR
- Qwen-Omni
FAQ
- Fun-ASR
- Qwen-ASR
- Qwen-Omni
How can I improve recognition accuracy?
You should consider all relevant factors and take appropriate action.Key factors include the following:- Sound quality: The quality of the recording device, the sample rate, and environmental noise affect audio clarity. High-quality audio is essential for accurate recognition.
- Speaker characteristics: Differences in pitch, speech rate, accent, and dialect can make recognition more difficult, especially for rare dialects or heavy accents.
- Language and vocabulary: Mixed languages, professional jargon, or slang can make recognition more difficult. You can configure hotwords to optimize recognition for these cases.
- Contextual understanding: Lack of context can lead to semantic ambiguity, especially in situations where context is necessary for correct recognition.
- Optimize audio quality: Use high-performance microphones and devices that support the recommended sample rate. Reduce environmental noise and echo.
- Adapt to the speaker: For scenarios that involve strong accents or diverse dialects, choose a model that supports those dialects.
- Configure hotwords: Set hotwords for technical terms, proper nouns, and other specific terms. For more information, see Customize hotwords.
- Preserve context: Avoid segmenting audio into clips that are too short.
Q: How do I provide a publicly accessible audio URL for the API?
We recommend using Object Storage Service (OSS), which provides highly available and reliable storage and easily generates public URLs.Verify your URL is publicly accessible: Open the URL in a browser or use curl to ensure the audio file downloads or plays successfully (HTTP status code 200).Q: How do I check if my audio format meets requirements?
Use the open-source tool ffprobe to quickly get detailed audio information:Copy
# Check container format (format_name), codec (codec_name), sample rate (sample_rate), and number of channels (channels)
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 your_audio_file.mp3
Q: How do I process audio to meet model requirements?
Use the open-source tool FFmpeg to trim or convert audio formats:- Audio trimming: Extract a segment from a long audio file
Copy
# -i: Input file
# -ss 00:01:30: Start time (1 minute 30 seconds)
# -t 00:02:00: Duration (2 minutes)
# -c copy: Copy audio stream without re-encoding (fast)
# output_clip.wav: Output file
ffmpeg -i long_audio.wav -ss 00:01:30 -t 00:02:00 -c copy output_clip.wav
- Format conversion For example, convert any audio to 16 kHz, 16-bit, mono WAV
Copy
# -i: Input file
# -ac 1: Set to 1 channel (mono)
# -ar 16000: Set sample rate to 16000 Hz (16 kHz)
# -sample_fmt s16: Set sample format to 16-bit signed integer PCM
# output.wav: Output file
ffmpeg -i input.mp3 -ac 1 -ar 16000 -sample_fmt s16 output.wav
Qwen-Omni interprets all audio, not just speech. Music, typing, or ambient noise may produce descriptions instead of transcription. For mixed audio, preprocess with VAD to isolate speech, or add a system prompt instruction: "Transcribe only human speech. Ignore non-speech sounds."
Q: When should I use Qwen-Omni instead of dedicated ASR models?
Use Qwen-Omni when:- Domain-specific terminology: You need to transcribe audio with specialized vocabulary. Describe your domain in the system prompt.
- Context-aware transcription: You want to provide conversation context for improved accuracy.
- Multimodal understanding: You need to process audio alongside images or video.
- OpenAI compatibility: You prefer using the OpenAI-compatible API.
- Lower latency: Dedicated ASR models have lower per-request latency.
- Hotwords: You need hotword support (only available in Fun-ASR).
- Speaker diarization: You need to identify different speakers (only available in Fun-ASR).
- Long audio files: Fun-ASR supports up to 12 hours of audio.
Q: How do I improve transcription accuracy with Qwen-Omni?
Use the system prompt to provide context:Copy
messages=[
{"role": "system", "content": "You are transcribing a medical consultation. Key terms: diabetes, hypertension, metformin."},
{"role": "user", "content": [{"type": "input_audio", "input_audio": {"data": "..."}}, {"type": "text", "text": "Transcribe this audio."}]}
]