OpenAI compatible ASR

POST

/compatible-mode/v1/chat/completions

from openai import OpenAI
import os

try:
  client = OpenAI(
    # If you have not configured environment variables, replace the following line with your API key: api_key = "sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
  )
  

  stream_enabled = False  # Whether to enable streaming output
  completion = client.chat.completions.create(
    model="qwen3-asr-flash",
    messages=[
      {
        "content": [
          {
            "type": "input_audio",
            "input_audio": {
              "data": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"
            }
          }
        ],
        "role": "user"
      }
    ],
    stream=stream_enabled,
    # When stream is set to False, the stream_options parameter cannot be set
    # stream_options={"include_usage": True},
    extra_body={
      "asr_options": {
        # "language": "zh",
        "enable_itn": False
      }
    }
  )
  if stream_enabled:
    full_content = ""
    print("Streaming output content is:")
    for chunk in completion:
      # If stream_options.include_usage is True, the choices field of the last chunk is an empty list and should be skipped (you can get token usage via chunk.usage)
      print(chunk)
      if chunk.choices and chunk.choices[0].delta.content:
        full_content += chunk.choices[0].delta.content
    print(f"Full content is: {full_content}")
  else:
    print(f"Non-streaming output content is: {completion.choices[0].message.content}")
except Exception as e:
  print(f"Error message: {e}")

{
  "id": "chatcmpl-487abe5f-d4f2-9363-a877-xxxxxxx",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "annotations": [
          {
            "emotion": "neutral",
            "language": "zh",
            "type": "audio_info"
          }
        ],
        "content": "Welcome to Qwen Cloud.",
        "role": "assistant"
      }
    }
  ],
  "created": 1767683986,
  "model": "qwen3-asr-flash",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 12,
    "completion_tokens_details": {
      "text_tokens": 12
    },
    "prompt_tokens": 42,
    "prompt_tokens_details": {
      "audio_tokens": 42,
      "text_tokens": 0
    },
    "seconds": 1,
    "total_tokens": 54
  }
}

Connection methods

Choose the method that matches your model.

Model	Connection method
Qwen3-ASR-Flash-Filetrans	DashScope asynchronous only
Qwen3-ASR-Flash	OpenAI compatible and DashScope synchronous

Supported audio formats

Qwen3-ASR-Flash accepts Base64-encoded audio or publicly accessible URLs.

Base64-encoded audio input

Use the Data URL format: data:<mediatype>;base64,<data>.

<mediatype>: The MIME type. For example, WAV: audio/wav, MP3: audio/mpeg.
<data>: The Base64-encoded string. Encoding increases file size. Keep the encoded audio within the 10 MB limit.

Example: data:audio/wav;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//PAxABQ/BXRbMPe4IQAhl9

Base64 encoding examples

Python
Java

import base64, pathlib

# Replace "input.mp3" with your audio file path
file_path = pathlib.Path("input.mp3")
base64_str = base64.b64encode(file_path.read_bytes()).decode()
data_uri = f"data:audio/mpeg;base64,{base64_str}"

import java.nio.file.*;
import java.util.Base64;

public class Main {
  /**
     * Replace "filePath" with your audio file path.
     */
  public static String toDataUrl(String filePath) throws Exception {
    byte[] bytes = Files.readAllBytes(Paths.get(filePath));
    String encoded = Base64.getEncoder().encodeToString(bytes);
    return "data:audio/mpeg;base64," + encoded;
  }

  // Example usage
  public static void main(String[] args) throws Exception {
    System.out.println(toDataUrl("input.mp3"));
  }
}

asr_options is non-standard. With the OpenAI SDK, pass it via extra_body.

Authorizations

string

header

required

DashScope API key. Get your API key from Qwen Cloud console.

Body

application/json

string

required

The model name. Only applicable to Qwen3-ASR-Flash.

object[]

required

The list of messages.

Show child attributes

enum<string>

required

The role of the message sender.

Available options:system,user

object[]

required

The content of the message.

Show child attributes

enum<string>

Set to input_audio for audio input.

Available options:input_audio

object

The audio input object.

Show child attributes

string

The audio to recognize. Supports URLs of Internet-accessible files and Base64-encoded data (Data URL format: data:<mediatype>;base64,<data>).

string

Context for customized recognition (System Message only). Provide background text, entity vocabularies, and other reference information. Length limit: 10,000 tokens.

object

Specifies whether to enable certain features. Not a standard OpenAI parameter — pass it through extra_body when using an OpenAI SDK.

Show child attributes

enum<string>

If you know the language of the audio, specify it to improve recognition accuracy. Specify only one language. If the audio contains multiple languages, do not specify this parameter.

Available options:zh,yue,en,ja,de,ko,ru,fr,pt,ar,it,es,hi,id,th,tr,uk,vi,cs,da,fil,fi,is,ms,no,pl,sv

boolean

defaultfalse

Specifies whether to enable Inverse Text Normalization (ITN). Applicable only to Chinese and English audio.

boolean

defaultfalse

Specifies whether to use streaming output. We recommend setting this to true to improve responsiveness and reduce the risk of timeouts.

object

Configuration for streaming output. Takes effect only when stream is true.

Show child attributes

boolean

defaultfalse

Specifies whether to include token consumption information in the last data block of the response.

Response

200-application/json

string

The unique identifier for this call.

Example:chatcmpl-487abe5f-d4f2-9363-a877-xxxxxxx

object[]

The output information of the model.

Example:

[
  {
    "finish_reason": "stop",
    "index": 0,
    "message": {
      "annotations": [
        {
          "emotion": "neutral",
          "language": "zh",
          "type": "audio_info"
        }
      ],
      "content": "Welcome to Qwen Cloud.",
      "role": "assistant"
    }
  }
]

Show child attributes

enum<string>

null during generation. stop when finished naturally. length when output exceeded maximum length.

Available options:stop,length,null

Example:stop

integer

The index of the current object in the choices array.

Example:0

object

The message object output by the model.

Show child attributes

string

The role of the output message. Always assistant.

Example:assistant

string

The speech recognition result text.

Example:Welcome to Qwen Cloud.

object[]

Output annotation information, such as language and emotion.

Example:

[
  {
    "emotion": "neutral",
    "language": "zh",
    "type": "audio_info"
  }
]

Show child attributes

string

Set to audio_info.

Example:audio_info

enum<string>

The language of the recognized audio.

Available options:zh,yue,en,ja,de,ko,ru,fr,pt,ar,it,es,hi,id,th,tr,uk,vi,cs,da,fil,fi,is,ms,no,pl,sv

Example:zh

enum<string>

The emotion of the recognized audio.

Available options:surprised,neutral,happy,sad,disgusted,angry,fearful

Example:neutral

integer

The UNIX timestamp (in seconds) when the request was created.

Example:1767683986

string

The model used for this request.

Example:qwen3-asr-flash

string

Always chat.completion.

Example:chat.completion

object

Token consumption information.

Show child attributes

integer

The number of tokens in the model's output.

Example:12

object

Show child attributes

integer

The number of tokens in the model's output text.

Example:12

integer

The number of tokens in the input.

Example:42

object

Show child attributes

integer

The length of the input audio in tokens. Each second of audio converts to 25 tokens. Audio shorter than 1 second is counted as 1 second.

Example:42

integer

Ignore this parameter.

Example:0

integer

The duration of the audio in seconds.

Example:1

integer

The total number of tokens in the input and output.

Example:54

​Connection methods

​Supported audio formats

​Base64-encoded audio input

Authorizations

Body

Response

Connection methods

Supported audio formats

Base64-encoded audio input