Realtime speech recognition

Convert continuous audio streams into text in real time for scenarios such as live streaming captions, online meetings, voice chats, smart assistants, and intelligent customer service. Real-time speech recognition supports transcription from microphones, meeting recordings, or local audio files with punctuation, timestamps, and custom hotwords.

For model availability, supported languages, and feature comparison, see Speech-to-text models.

Getting started

Fun-ASR
Qwen-ASR

DashScope SDK
WebSocket API

For more code samples, see GitHub.Get an API key and set it as an environment variable. To use the SDK, install it.

Model availability

Model	Version	Unit price	Free quota (Note)
fun-asr-realtime Currently, fun-asr-realtime-2025-11-07	Stable	$0.00009/second	36,000 seconds (10 hours) Valid for 90 days
fun-asr-realtime-2025-11-07	Snapshot	$0.00009/second	36,000 seconds (10 hours) Valid for 90 days

Languages: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong-Taiwan regions -- including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. English and Japanese are also supported.
Sample rate: 16 kHz
Audio formats: pcm, wav, mp3, opus, speex, aac, amr

Recognize speech from a microphone

Recognizes speech from a microphone and outputs results in real time.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
  public static void main(String[] args) throws InterruptedException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    ExecutorService executorService = Executors.newSingleThreadExecutor();
    executorService.submit(new RealtimeRecognitionTask());
    executorService.shutdown();
    executorService.awaitTermination(1, TimeUnit.MINUTES);
    System.exit(0);
  }
}

class RealtimeRecognitionTask implements Runnable {
  @Override
  public void run() {
    RecognitionParam param = RecognitionParam.builder()
        .model("fun-asr-realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .format("wav")
        .sampleRate(16000)
        .build();
    Recognition recognizer = new Recognition();

    ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
      @Override
      public void onEvent(RecognitionResult result) {
        if (result.isSentenceEnd()) {
          System.out.println("Final Result: " + result.getSentence().getText());
        } else {
          System.out.println("Intermediate Result: " + result.getSentence().getText());
        }
      }

      @Override
      public void onComplete() {
        System.out.println("Recognition complete");
      }

      @Override
      public void onError(Exception e) {
        System.out.println("RecognitionCallback error: " + e.getMessage());
      }
    };
    try {
      recognizer.call(param, callback);
      // Create an audio format.
      AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
      // Match the default recording device based on the format.
      TargetDataLine targetDataLine =
          AudioSystem.getTargetDataLine(audioFormat);
      targetDataLine.open(audioFormat);
      // Start recording.
      targetDataLine.start();
      ByteBuffer buffer = ByteBuffer.allocate(1024);
      long start = System.currentTimeMillis();
      // Record for 50 seconds and perform real-time transcription.
      while (System.currentTimeMillis() - start < 50000) {
        int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
        if (read > 0) {
          buffer.limit(read);
          // Send the recorded audio data to the streaming recognition service.
          recognizer.sendAudioFrame(buffer);
          buffer = ByteBuffer.allocate(1024);
          // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
          Thread.sleep(20);
        }
      }
      recognizer.stop();
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      // Close the WebSocket connection after the task is complete.
      recognizer.getDuplexApi().close(1000, "bye");
    }

    System.out.println(
        "[Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
  }
}

Before running the Python example, run pip install pyaudio to install the third-party audio playback and capture suite. pyaudio requires the portaudio library. On Ubuntu/Debian: sudo apt-get install libportaudio2 portaudio19-dev. On macOS: brew install portaudio.

Recognize a local audio file

This feature recognizes and transcribes local audio files. This is ideal for near real-time scenarios with short audio, such as voice chats, commands, voice input, and voice search.

The audio file used in the examples below is asr_example.wav.

import com.alibaba.dashscope.api.GeneralApi;
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.base.HalfDuplexParamBase;
import com.alibaba.dashscope.common.GeneralListParam;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.protocol.GeneralServiceOption;
import com.alibaba.dashscope.protocol.HttpMethod;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.protocol.StreamingMode;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}

public class Main {
  public static void main(String[] args) throws InterruptedException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    // In a real application, call this method only once at program startup.
    warmUp();

    ExecutorService executorService = Executors.newSingleThreadExecutor();
    executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
    executorService.shutdown();

    // Wait for all tasks to complete.
    executorService.awaitTermination(1, TimeUnit.MINUTES);
    System.exit(0);
  }

  public static void warmUp() {
    try {
      // Lightweight GET request to establish a connection.
      GeneralServiceOption warmupOption = GeneralServiceOption.builder()
          .protocol(Protocol.HTTP)
          .httpMethod(HttpMethod.GET)
          .streamingMode(StreamingMode.OUT)
          .path("assistants")
          .build();

      warmupOption.setBaseHttpUrl(Constants.baseHttpApiUrl);
      GeneralApi<HalfDuplexParamBase> api = new GeneralApi<>();
      api.get(GeneralListParam.builder().limit(1L).build(), warmupOption);
    } catch (Exception e) {
      // Allow retry if warm-up fails.
    }
  }
}

class RealtimeRecognitionTask implements Runnable {
  private Path filepath;

  public RealtimeRecognitionTask(Path filepath) {
    this.filepath = filepath;
  }

  @Override
  public void run() {
    RecognitionParam param = RecognitionParam.builder()
        .model("fun-asr-realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .format("wav")
        .sampleRate(16000)
        .build();
    Recognition recognizer = new Recognition();

    String threadName = Thread.currentThread().getName();

    ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
      @Override
      public void onEvent(RecognitionResult message) {
        if (message.isSentenceEnd()) {

          System.out.println(TimeUtils.getTimestamp()+" "+
              "[process " + threadName + "] Final Result:" + message.getSentence().getText());
        } else {
          System.out.println(TimeUtils.getTimestamp()+" "+
              "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
        }
      }

      @Override
      public void onComplete() {
        System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
      }

      @Override
      public void onError(Exception e) {
        System.out.println(TimeUtils.getTimestamp()+" "+
            "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
      }
    };

    try {
      recognizer.call(param, callback);
      // Replace the path with your audio file path.
      System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
      // Read the file and send audio in chunks.
      FileInputStream fis = new FileInputStream(this.filepath.toFile());
      byte[] allData = new byte[fis.available()];
      int ret = fis.read(allData);
      fis.close();

      int sendFrameLength = 3200;
      for (int i = 0; i * sendFrameLength < allData.length; i ++) {
        int start = i * sendFrameLength;
        int end = Math.min(start + sendFrameLength, allData.length);
        ByteBuffer byteBuffer = ByteBuffer.wrap(allData, start, end - start);
        recognizer.sendAudioFrame(byteBuffer);
        Thread.sleep(100);
      }

      System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
      recognizer.stop();
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      // Close the WebSocket connection after the task is complete.
      recognizer.getDuplexApi().close(1000, "bye");
    }

    System.out.println(
        "["
            + threadName
            + "][Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
  }
}

The following examples demonstrate how to send a local audio file and retrieve recognition results through a native WebSocket connection.

The following examples use the audio file asr_example.wav.

Install dependencies

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Do not name your example file websocket.py. This name conflicts with the websocket library and triggers the following error: AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?

# pip install websocket-client
import os
import json
import time
import uuid
import threading
import websocket

# If you have not configured the environment variable, replace the line below with your API Key: api_key = "sk-xxx"
api_key = os.environ.get('DASHSCOPE_API_KEY')
url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/'
audio_file = 'asr_example.wav'  # Replace with your audio file path

TASK_ID = uuid.uuid4().hex[:32]
task_started = False

def send_run_task(ws):
  run_task_message = {
    'header': {
      'action': 'run-task',
      'task_id': TASK_ID,
      'streaming': 'duplex'
    },
    'payload': {
      'task_group': 'audio',
      'task': 'asr',
      'function': 'recognition',
      'model': 'fun-asr-realtime',
      'parameters': {
        'sample_rate': 16000,
        'format': 'wav'
      },
      'input': {}
    }
  }
  ws.send(json.dumps(run_task_message))

def send_finish_task(ws):
  finish_task_message = {
    'header': {
      'action': 'finish-task',
      'task_id': TASK_ID,
      'streaming': 'duplex'
    },
    'payload': {
      'input': {}
    }
  }
  ws.send(json.dumps(finish_task_message))

def send_audio_stream(ws):
  chunk_size = 3200  # 100ms @ 16kHz 16bit mono
  try:
    with open(audio_file, 'rb') as f:
      while True:
        chunk = f.read(chunk_size)
        if not chunk:
          break
        ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY)
        time.sleep(0.1)
    print('Audio stream ended')
    send_finish_task(ws)
  except Exception as e:
    print('Failed to read audio file: ', e)
    ws.close()

def on_open(ws):
  print('Connected to server')
  send_run_task(ws)

def on_message(ws, data):
  global task_started
  message = json.loads(data)
  event = message['header']['event']
  if event == 'task-started':
    print('Task started')
    task_started = True
    threading.Thread(target=send_audio_stream, args=(ws,), daemon=True).start()
  elif event == 'result-generated':
    print('Recognition result: ', message['payload']['output']['sentence']['text'])
    if message['payload'].get('usage'):
      print('Task billing duration (seconds): ', message['payload']['usage']['duration'])
  elif event == 'task-finished':
    print('Task finished')
    ws.close()
  elif event == 'task-failed':
    print('Task failed: ', message['header'].get('error_message'))
    ws.close()
  else:
    print('Unknown event: ', event)

def on_close(ws, close_status_code, close_msg):
  if not task_started:
    print('Task not started, closing connection')

def on_error(ws, error):
  print('WebSocket error: ', error)

if __name__ == '__main__':
  ws = websocket.WebSocketApp(
    url,
    header={'Authorization': f'bearer {api_key}'},
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
  )
  ws.run_forever()

DashScope SDK
WebSocket API

Install the SDK

Install the SDK. Make sure that the DashScope SDK version is 2.22.5 or later (Java) or 1.25.6 or later (Python).

Get an API key

Get an API key. Set the API key as an environment variable to avoid hard coding it in your code.

Run the sample code

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.sound.sampled.LineUnavailableException;
import java.io.File;
import java.io.FileInputStream;
import java.util.Base64;
import java.util.Collections;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;

public class Qwen3AsrRealtimeUsage {
  private static final Logger log = LoggerFactory.getLogger(Qwen3AsrRealtimeUsage.class);
  private static final int AUDIO_CHUNK_SIZE = 1024; // Audio chunk size in bytes
  private static final int SLEEP_INTERVAL_MS = 30;  // Sleep interval in milliseconds

  public static void main(String[] args) throws InterruptedException, LineUnavailableException {
    CountDownLatch finishLatch = new CountDownLatch(1);

    OmniRealtimeParam param = OmniRealtimeParam.builder()
        .model("qwen3-asr-flash-realtime")
        .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apikey("sk-xxx")
        .apikey(System.getenv("DASHSCOPE_API_KEY"))
        .build();

    OmniRealtimeConversation conversation = null;
    final AtomicReference<OmniRealtimeConversation> conversationRef = new AtomicReference<>(null);
    conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
      @Override
      public void onOpen() {
        System.out.println("connection opened");
      }
      @Override
      public void onEvent(JsonObject message) {
        String type = message.get("type").getAsString();
        switch(type) {
          case "session.created":
            System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
            break;
          case "conversation.item.input_audio_transcription.completed":
            System.out.println("transcription: " + message.get("transcript").getAsString());
            finishLatch.countDown();
            break;
          case "input_audio_buffer.speech_started":
            System.out.println("======VAD Speech Start======");
            break;
          case "input_audio_buffer.speech_stopped":
            System.out.println("======VAD Speech Stop======");
            break;
          case "conversation.item.input_audio_transcription.text":
            System.out.println("transcription: " + message.get("text").getAsString());
            break;
          default:
            break;
        }
      }
      @Override
      public void onClose(int code, String reason) {
        System.out.println("connection closed code: " + code + ", reason: " + reason);
      }
    });
    conversationRef.set(conversation);
    try {
      conversation.connect();
    } catch (NoApiKeyException e) {
      throw new RuntimeException(e);
    }

    OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
    transcriptionParam.setLanguage("zh");
    transcriptionParam.setInputAudioFormat("pcm");
    transcriptionParam.setInputSampleRate(16000);

    OmniRealtimeConfig config = OmniRealtimeConfig.builder()
        .modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
        .transcriptionConfig(transcriptionParam)
        .build();
    conversation.updateSession(config);

    String filePath = "your_audio_file.pcm";
    File audioFile = new File(filePath);
    if (!audioFile.exists()) {
      log.error("Audio file not found: {}", filePath);
      return;
    }

    try (FileInputStream audioInputStream = new FileInputStream(audioFile)) {
      byte[] audioBuffer = new byte[AUDIO_CHUNK_SIZE];
      int bytesRead;
      int totalBytesRead = 0;

      log.info("Starting to send audio data from: {}", filePath);

      // Read and send audio data in chunks
      while ((bytesRead = audioInputStream.read(audioBuffer)) != -1) {
        totalBytesRead += bytesRead;
        String audioB64 = Base64.getEncoder().encodeToString(audioBuffer);
        // Send audio chunk to conversation
        conversation.appendAudio(audioB64);

        // Add small delay to simulate real-time audio streaming
        Thread.sleep(SLEEP_INTERVAL_MS);
      }

      log.info("Finished sending audio data. Total bytes sent: {}", totalBytesRead);

    } catch (Exception e) {
      log.error("Error sending audio from file: {}", filePath, e);
    }

    // Send session.finish, wait for the session to finish, and then close the connection.
    conversation.endSession();
    log.info("Task finished");

    System.exit(0);
  }
}

The following example shows how to send a local audio file and retrieve recognition results over a WebSocket connection.

Get an API key

Get an API key. For security, set the API key as an environment variable.

Install dependencies

Python:Before you run the example, install the dependencies:

pip uninstall websocket-client
pip uninstall websocket
pip install websocket-client

Do not name the sample code file websocket.py. Otherwise, the following error may occur: AttributeError: module 'websocket' has no attribute 'WebSocketApp'. Did you mean: 'WebSocket'?

Java:Add the Java-WebSocket dependency:

<dependency>
  <groupId>org.java-websocket</groupId>
  <artifactId>Java-WebSocket</artifactId>
  <version>1.5.6</version>
</dependency>

Node.js:

npm install ws

Write and run code

Implement the complete flow of authentication, connection, sending audio, and receiving results. For more information, see Interaction flow.

# pip install websocket-client
import os
import time
import json
import threading
import base64
import websocket
import logging
import logging.handlers
from datetime import datetime

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

# If you have not configured an environment variable, replace the following line with your API key: API_KEY="sk-xxx"
API_KEY = os.environ.get("DASHSCOPE_API_KEY", "sk-xxx")
QWEN_MODEL = "qwen3-asr-flash-realtime"
baseUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
url = f"{baseUrl}?model={QWEN_MODEL}"
print(f"Connecting to server: {url}")

# Note: If you are not in VAD mode, the cumulative duration of continuously sent audio should not exceed 60s.
enableServerVad = True
is_running = True  # Add a running flag.

headers = [
  "Authorization: Bearer " + API_KEY,
  "OpenAI-Beta: realtime=v1"
]

def init_logger():
  formatter = logging.Formatter('%(asctime)s|%(levelname)s|%(message)s')
  f_handler = logging.handlers.RotatingFileHandler(
    "omni_tester.log", maxBytes=100 * 1024 * 1024, backupCount=3
  )
  f_handler.setLevel(logging.DEBUG)
  f_handler.setFormatter(formatter)

  console = logging.StreamHandler()
  console.setLevel(logging.DEBUG)
  console.setFormatter(formatter)

  logger.addHandler(f_handler)
  logger.addHandler(console)

def on_open(ws):
  logger.info("Connected to server.")

  # Session update event.
  event_manual = {
    "event_id": "event_123",
    "type": "session.update",
    "session": {
      "modalities": ["text"],
      "input_audio_format": "pcm",
      "sample_rate": 16000,
      "input_audio_transcription": {
        # Language identifier, optional. If you have clear language information, set it.
        "language": "zh"
      },
      "turn_detection": None
    }
  }
  event_vad = {
    "event_id": "event_123",
    "type": "session.update",
    "session": {
      "modalities": ["text"],
      "input_audio_format": "pcm",
      "sample_rate": 16000,
      "input_audio_transcription": {
        "language": "zh"
      },
      "turn_detection": {
        "type": "server_vad",
        "threshold": 0.0,
        "silence_duration_ms": 400
      }
    }
  }
  if enableServerVad:
    logger.info(f"Sending event: {json.dumps(event_vad, indent=2)}")
    ws.send(json.dumps(event_vad))
  else:
    logger.info(f"Sending event: {json.dumps(event_manual, indent=2)}")
    ws.send(json.dumps(event_manual))

def on_message(ws, message):
  global is_running
  try:
    data = json.loads(message)
    logger.info(f"Received event: {json.dumps(data, ensure_ascii=False, indent=2)}")
    if data.get("type") == "session.finished":
      logger.info(f"Final transcript: {data.get('transcript')}")
      logger.info("Closing WebSocket connection after session finished...")
      is_running = False  # Stop the audio sending thread.
      ws.close()
  except json.JSONDecodeError:
    logger.error(f"Failed to parse message: {message}")

def on_error(ws, error):
  logger.error(f"Error: {error}")

def on_close(ws, close_status_code, close_msg):
  logger.info(f"Connection closed: {close_status_code} - {close_msg}")

def send_audio(ws, local_audio_path):
  time.sleep(3)  # Wait for the session update to complete.
  global is_running

  with open(local_audio_path, 'rb') as audio_file:
    logger.info(f"Start reading the file: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
    while is_running:
      audio_data = audio_file.read(3200)  # ~0.1s PCM16/16kHz
      if not audio_data:
        logger.info(f"Finished reading the file: {datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]}")
        if ws.sock and ws.sock.connected:
          if not enableServerVad:
            commit_event = {
              "event_id": "event_789",
              "type": "input_audio_buffer.commit"
            }
            ws.send(json.dumps(commit_event))
          finish_event = {
            "event_id": "event_987",
            "type": "session.finish"
          }
          ws.send(json.dumps(finish_event))
        break

      if not ws.sock or not ws.sock.connected:
        logger.info("The WebSocket is closed. Stop sending audio.")
        break

      encoded_data = base64.b64encode(audio_data).decode('utf-8')
      eventd = {
        "event_id": f"event_{int(time.time() * 1000)}",
        "type": "input_audio_buffer.append",
        "audio": encoded_data
      }
      ws.send(json.dumps(eventd))
      logger.info(f"Sending audio event: {eventd['event_id']}")
      time.sleep(0.1)  # Simulate real-time collection.

# Initialize the logger.
init_logger()
logger.info(f"Connecting to WebSocket server at {url}...")

local_audio_path = "your_audio_file.pcm"
ws = websocket.WebSocketApp(
  url,
  header=headers,
  on_open=on_open,
  on_message=on_message,
  on_error=on_error,
  on_close=on_close
)

thread = threading.Thread(target=send_audio, args=(ws, local_audio_path))
thread.start()
ws.run_forever()

Going live

Improve recognition accuracy

Select a model with the correct sample rate: For 8 kHz telephone audio, use an 8 kHz model directly instead of upsampling it to 16 kHz for recognition. This avoids information distortion and yields better results.
Use the custom vocabulary feature: For proprietary nouns, names, and brand names specific to your business, you can configure a custom vocabulary to significantly improve recognition accuracy. For more information, see Customize a vocabulary.
Optimize input audio quality: Use high-quality microphones whenever possible and ensure a high signal-to-noise ratio (SNR) and an echo-free recording environment. At the application level, you can integrate algorithms such as noise reduction (for example, RNNoise) and acoustic echo cancellation (AEC) to preprocess the audio to obtain a cleaner signal.
Specify the recognition language: For multilingual models, if you can predetermine the audio language when making a call, it helps the model converge and avoid confusion between similarly pronounced languages, which improves accuracy.

Set a fault tolerance policy

Client-side reconnection: The client should implement an automatic reconnection mechanism to handle network jitter. For the Python SDK, consider the following suggestions:
1. Catch exceptions: Implement the on_error method in the Callback class. The dashscope SDK calls this method when it encounters a network error or other issues.
2. Notify status: When on_error is triggered, set a reconnection signal. In Python, you can use threading.Event, which is a thread-safe flag.
3. Reconnection loop: Wrap the main logic in a for loop (for example, to retry 3 times). When the reconnection signal is detected, the current recognition is interrupted, resources are cleaned up, and the loop restarts after a few seconds to create a new connection.
Set a heartbeat to prevent connection loss: To maintain a persistent connection with the server, set the heartbeat parameter to true. This ensures that the connection to the server is not interrupted, even during long periods of silence in the audio.
Rate limits: When you call the model interface, take note of the model's rate limits rules.

Core usage: Context biasing (Qwen-ASR)

By providing context, you can optimize the recognition of domain-specific vocabulary, such as names, places, and product terms. Length limit: The context content cannot exceed 10,000 tokens. Usage:

WebSocket API: Set the session.input_audio_transcription.corpus.text parameter in the session.update event.
Python SDK: Set the corpus_text parameter.
Java SDK: Set the corpusText parameter.

Supported text types: These include but are not limited to:

Hotword lists in various separator formats, such as Hotword 1, Hotword 2, Hotword 3, Hotword 4
Text paragraphs or chapters of any format and length
Mixed content: Any combination of word lists and paragraphs
Irrelevant or meaningless text, including garbled text. The feature is highly fault-tolerant and is almost never negatively affected by irrelevant text.

Example: The correct transcription of an audio segment should be: "What internal jargon from the investment banking circle do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB..."

Without context enhancement	With context enhancement
Without context enhancement, some investment bank names may be misrecognized. For example, "Bird Rock" should be "Bulge Bracket". Recognition result: "What internal jargon from the investment banking circle do you know? First, the nine major foreign investment banks, Bird Rock, BB..."	With context enhancement, investment bank names are recognized correctly. Recognition result: "What internal jargon from the investment banking circle do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB..."

To achieve the result above, add any of the following content to the context:

Word lists:
- Word list 1:

Bulge Bracket, Boutique, Middle Market, domestic securities firms

Word list 2:

Bulge Bracket Boutique Middle Market domestic securities firms

Word list 3:

['Bulge Bracket', 'Boutique', 'Middle Market', 'domestic securities firms']

Natural language:

Investment Banking Categories Revealed!
Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
Bulge Bracket Investment Banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large banks are enormous in both business scope and scale.
Boutique Investment Banks: These banks are relatively small but highly specialized in their business areas. For example, Lazard, Evercore, etc., have deep professional knowledge and experience in specific fields.
Middle Market Investment Banks: This type of bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major banks, they have a high influence in specific markets.
Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
In addition, there are some Position and business divisions, you can refer to the relevant charts. I hope this information helps you better understand investment banking and prepare for your future career!

Natural language with interference: Some text is irrelevant to the recognition content, such as the names in the example below.

Investment Banking Categories Revealed!
Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
Bulge Bracket Investment Banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large banks are enormous in both business scope and scale.
Boutique Investment Banks: These banks are relatively small but highly specialized in their business areas. For example, Lazard, Evercore, etc., have deep professional knowledge and experience in specific fields.
Middle Market Investment Banks: This type of bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major banks, they have a high influence in specific markets.
Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
In addition, there are some Position and business divisions, you can refer to the relevant charts. I hope this information helps you better understand investment banking and prepare for your future career!
Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing, Xu Ruoxi, Sun Haoran, Hu Jinyu, Zhu Chenxi, Guo Wenbo, He Jingshu, Gao Yuhang, Lin Yifei,
Zheng Xiaoyan, Liang Bowen, Luo Jiaqi, Song Mingzhe, Xie Wanting, Tang Ziqian, Han Mengyao, Feng Yiran, Cao Qinxue, Deng Zirui, Xiao Wangshu, Xu Jiashu,
Cheng Yinuo, Yuan Zhiruo, Peng Haoyu, Dong Simiao, Fan Jingyu, Su Zijin, Lv Wenxuan, Jiang Shihan, Ding Muchen,
Wei Shuyao, Ren Tianyou, Jiang Yichen, Hua Qingyu, Shen Xinghe, Fu Jinyu, Yao Xingchen, Zhong Lingyu, Yan Licheng, Jin Ruoshui, Taoranting, Qi Shaoshang, Xue Zhilan, Zou Yunfan, Xiong Ziang, Bai Wenfeng, Yi Qianfan

Core usage: Sensitive word filtering (Fun-ASR)

Sensitive word filtering replaces or removes sensitive words from recognition results. It is suitable for scenarios such as customer service quality inspection, content compliance, and subtitle review. Supported models: Fun-ASR only. Limit: Up to 32 sensitive words. Default behavior: If the special_word_filter parameter is not specified, no sensitive word filtering is applied. Configuration: special_word_filter is a JSON object with three sub-fields:

filter_with_signed.word_list: A string array that lists sensitive words to be replaced with asterisks (*) of equal length. For example, if ["test"] is specified, "please help me test it" becomes "please help me **** it".
filter_with_empty.word_list: A string array that lists sensitive words to be completely removed from the results. For example, if ["start"] is specified, "is the game about to start now" becomes "is the game about to now".
system_reserved_filter: A Boolean value. Default is false. Specifies whether to enable sensitive word filtering.

Configuration example:

{
  "special_word_filter": {
    "filter_with_signed": {
      "word_list": ["test"]
    },
    "filter_with_empty": {
      "word_list": ["start", "happen"]
    },
    "system_reserved_filter": true
  }
}

Different SDKs expose these parameters with different naming conventions (such as dictionary keys, object properties, or methods). For the full field mapping, see the API reference.

API reference

Fun-ASR
Qwen-ASR

Fun-ASR real-time speech recognition API reference

Interaction flow (Qwen-ASR-Realtime)

Qwen real-time speech recognition streams audio over WebSocket. Two modes are available: VAD mode (default) and Manual mode.

URL

Replace <model_name> with your model name.

wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=<model_name>

Headers

"Authorization": "Bearer $DASHSCOPE_API_KEY"

VAD mode (default)

The server detects speech boundaries and segments sentences. The client streams audio, and the server returns results when each sentence ends. Best for conversations and meeting transcription. Enable: Set session.turn_detection in session.update.

The client sends input_audio_buffer.append to add audio to the buffer.
The server returns input_audio_buffer.speech_started when it detects speech.
If the client sends session.finish before this event, the server returns session.finished and the client must disconnect.
The client continues sending input_audio_buffer.append.
After all audio is sent, the client sends session.finish to end the session.
The server returns input_audio_buffer.speech_stopped when it detects the end of speech.
The server returns input_audio_buffer.committed.
The server returns conversation.item.created.
The server returns conversation.item.input_audio_transcription.text with real-time transcription results.
The server returns conversation.item.input_audio_transcription.completed with the final transcription result.
The server returns session.finished when recognition completes. The client must then disconnect.

Manual mode

The client controls sentence segmentation by sending audio for a complete sentence, then sending input_audio_buffer.commit. Best when the client knows sentence boundaries, for example in chat app voice messages. Enable: Set session.turn_detection to null in session.update.

The client sends input_audio_buffer.append to add audio to the buffer.
The client sends input_audio_buffer.commit to create a new user message.
The client sends session.finish to end the session.
The server returns input_audio_buffer.committed.
The server returns conversation.item.input_audio_transcription.text with real-time transcription results.
The server returns conversation.item.input_audio_transcription.completed with the final transcription result.
The server returns session.finished when recognition completes. The client must then disconnect.

Alternative: Use Qwen-Omni

You can also use Qwen-Omni (qwen3-omni-flash-realtime) for real-time speech recognition via WebSocket. Omni is an LLM that understands audio — you provide domain context through the system prompt instead of hotword lists. When to use Omni for ASR: Clean speech inputs (microphone, voice calls) where you need domain-specific terminology handling via prompt. When to use dedicated ASR models instead: Noisy or mixed audio (meetings with background music, videos with sound effects), or when you need hotwords, speaker diarization, or timestamps.

Qwen-Omni interprets all audio, not just speech. Music, typing, or ambient noise may produce descriptions instead of transcription. For mixed audio, preprocess with VAD to isolate speech, or use a dedicated ASR model.

ASR prompt template:

messages = [
  {"role": "system", "content": "Transcribe the following audio exactly as spoken. Output only the transcription text. Ignore non-speech sounds."},
  {"role": "user", "content": [{"type": "input_audio", "input_audio": {"data": audio_data, "format": "wav"}}]}
]

Qwen-Omni-Realtime uses WebSocket for bidirectional streaming. For the full API and SDK reference, see Realtime conversation.

​Getting started

Model availability

Recognize speech from a microphone

Recognize a local audio file

​Going live

​Improve recognition accuracy

​Set a fault tolerance policy

​Core usage: Context biasing (Qwen-ASR)

​Core usage: Sensitive word filtering (Fun-ASR)

​API reference

​Interaction flow (Qwen-ASR-Realtime)

​URL

​Headers

​VAD mode (default)

​Manual mode

​Alternative: Use Qwen-Omni

Getting started

Going live

Improve recognition accuracy

Set a fault tolerance policy

Core usage: Context biasing (Qwen-ASR)

Core usage: Sensitive word filtering (Fun-ASR)

API reference

Interaction flow (Qwen-ASR-Realtime)

URL

Headers

VAD mode (default)

Manual mode

Alternative: Use Qwen-Omni