Skip to main content
Speech-to-text

Realtime speech recognition

Live speech to text

Convert continuous audio streams into text in real time for scenarios such as live streaming captions, online meetings, voice chats, smart assistants, and intelligent customer service. Real-time speech recognition supports transcription from microphones, meeting recordings, or local audio files with punctuation, timestamps, and custom hotwords.
For model availability, supported languages, and feature comparison, see Speech-to-text models.

Getting started

  • Fun-ASR
  • Qwen-ASR
For more code samples, see GitHub.Get an API key and set it as an environment variable. To use the SDK, install it.

Recognize speech from a microphone

Recognizes speech from a microphone and outputs results in real time.
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
  public static void main(String[] args) throws InterruptedException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    ExecutorService executorService = Executors.newSingleThreadExecutor();
    executorService.submit(new RealtimeRecognitionTask());
    executorService.shutdown();
    executorService.awaitTermination(1, TimeUnit.MINUTES);
    System.exit(0);
  }
}

class RealtimeRecognitionTask implements Runnable {
  @Override
  public void run() {
    RecognitionParam param = RecognitionParam.builder()
        .model("fun-asr-realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .format("wav")
        .sampleRate(16000)
        .build();
    Recognition recognizer = new Recognition();

    ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
      @Override
      public void onEvent(RecognitionResult result) {
        if (result.isSentenceEnd()) {
          System.out.println("Final Result: " + result.getSentence().getText());
        } else {
          System.out.println("Intermediate Result: " + result.getSentence().getText());
        }
      }

      @Override
      public void onComplete() {
        System.out.println("Recognition complete");
      }

      @Override
      public void onError(Exception e) {
        System.out.println("RecognitionCallback error: " + e.getMessage());
      }
    };
    try {
      recognizer.call(param, callback);
      // Create an audio format.
      AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
      // Match the default recording device based on the format.
      TargetDataLine targetDataLine =
          AudioSystem.getTargetDataLine(audioFormat);
      targetDataLine.open(audioFormat);
      // Start recording.
      targetDataLine.start();
      ByteBuffer buffer = ByteBuffer.allocate(1024);
      long start = System.currentTimeMillis();
      // Record for 50 seconds and perform real-time transcription.
      while (System.currentTimeMillis() - start < 50000) {
        int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
        if (read > 0) {
          buffer.limit(read);
          // Send the recorded audio data to the streaming recognition service.
          recognizer.sendAudioFrame(buffer);
          buffer = ByteBuffer.allocate(1024);
          // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
          Thread.sleep(20);
        }
      }
      recognizer.stop();
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      // Close the WebSocket connection after the task is complete.
      recognizer.getDuplexApi().close(1000, "bye");
    }

    System.out.println(
        "[Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
  }
}
Before running the Python example, run pip install pyaudio to install the third-party audio playback and capture suite.

Recognize a local audio file

This feature recognizes and transcribes local audio files. This is ideal for near real-time scenarios with short audio, such as voice chats, commands, voice input, and voice search.
The audio file used in the examples below is asr_example.wav.
import com.alibaba.dashscope.api.GeneralApi;
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.base.HalfDuplexParamBase;
import com.alibaba.dashscope.common.GeneralListParam;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.protocol.GeneralServiceOption;
import com.alibaba.dashscope.protocol.HttpMethod;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.protocol.StreamingMode;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}

public class Main {
  public static void main(String[] args) throws InterruptedException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    // In a real application, call this method only once at program startup.
    warmUp();

    ExecutorService executorService = Executors.newSingleThreadExecutor();
    executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
    executorService.shutdown();

    // Wait for all tasks to complete.
    executorService.awaitTermination(1, TimeUnit.MINUTES);
    System.exit(0);
  }

  public static void warmUp() {
    try {
      // Lightweight GET request to establish a connection.
      GeneralServiceOption warmupOption = GeneralServiceOption.builder()
          .protocol(Protocol.HTTP)
          .httpMethod(HttpMethod.GET)
          .streamingMode(StreamingMode.OUT)
          .path("assistants")
          .build();

      warmupOption.setBaseHttpUrl(Constants.baseHttpApiUrl);
      GeneralApi<HalfDuplexParamBase> api = new GeneralApi<>();
      api.get(GeneralListParam.builder().limit(1L).build(), warmupOption);
    } catch (Exception e) {
      // Allow retry if warm-up fails.
    }
  }
}

class RealtimeRecognitionTask implements Runnable {
  private Path filepath;

  public RealtimeRecognitionTask(Path filepath) {
    this.filepath = filepath;
  }

  @Override
  public void run() {
    RecognitionParam param = RecognitionParam.builder()
        .model("fun-asr-realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .format("wav")
        .sampleRate(16000)
        .build();
    Recognition recognizer = new Recognition();

    String threadName = Thread.currentThread().getName();

    ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
      @Override
      public void onEvent(RecognitionResult message) {
        if (message.isSentenceEnd()) {

          System.out.println(TimeUtils.getTimestamp()+" "+
              "[process " + threadName + "] Final Result:" + message.getSentence().getText());
        } else {
          System.out.println(TimeUtils.getTimestamp()+" "+
              "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
        }
      }

      @Override
      public void onComplete() {
        System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
      }

      @Override
      public void onError(Exception e) {
        System.out.println(TimeUtils.getTimestamp()+" "+
            "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
      }
    };

    try {
      recognizer.call(param, callback);
      // Replace the path with your audio file path.
      System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
      // Read the file and send audio in chunks.
      FileInputStream fis = new FileInputStream(this.filepath.toFile());
      byte[] allData = new byte[fis.available()];
      int ret = fis.read(allData);
      fis.close();

      int sendFrameLength = 3200;
      for (int i = 0; i * sendFrameLength < allData.length; i ++) {
        int start = i * sendFrameLength;
        int end = Math.min(start + sendFrameLength, allData.length);
        ByteBuffer byteBuffer = ByteBuffer.wrap(allData, start, end - start);
        recognizer.sendAudioFrame(byteBuffer);
        Thread.sleep(100);
      }

      System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
      recognizer.stop();
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      // Close the WebSocket connection after the task is complete.
      recognizer.getDuplexApi().close(1000, "bye");
    }

    System.out.println(
        "["
            + threadName
            + "][Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
  }
}

Going live

Improve recognition accuracy

  • Select a model with the correct sample rate: For 8 kHz telephone audio, use an 8 kHz model directly instead of upsampling it to 16 kHz for recognition. This avoids information distortion and yields better results.
  • Use the custom vocabulary feature: For proprietary nouns, names, and brand names specific to your business, you can configure a custom vocabulary to significantly improve recognition accuracy. For more information, see Customize a vocabulary.
  • Optimize input audio quality: Use high-quality microphones whenever possible and ensure a high signal-to-noise ratio (SNR) and an echo-free recording environment. At the application level, you can integrate algorithms such as noise reduction (for example, RNNoise) and acoustic echo cancellation (AEC) to preprocess the audio to obtain a cleaner signal.
  • Specify the recognition language: For multilingual models, if you can predetermine the audio language when making a call, it helps the model converge and avoid confusion between similarly pronounced languages, which improves accuracy.

Set a fault tolerance policy

  • Client-side reconnection: The client should implement an automatic reconnection mechanism to handle network jitter. For the Python SDK, consider the following suggestions:
    1. Catch exceptions: Implement the on_error method in the Callback class. The dashscope SDK calls this method when it encounters a network error or other issues.
    2. Notify status: When on_error is triggered, set a reconnection signal. In Python, you can use threading.Event, which is a thread-safe flag.
    3. Reconnection loop: Wrap the main logic in a for loop (for example, to retry 3 times). When the reconnection signal is detected, the current recognition is interrupted, resources are cleaned up, and the loop restarts after a few seconds to create a new connection.
  • Set a heartbeat to prevent connection loss: To maintain a persistent connection with the server, set the heartbeat parameter to true. This ensures that the connection to the server is not interrupted, even during long periods of silence in the audio.
  • Rate limits: When you call the model interface, take note of the model's rate limits rules.

Core usage: Context biasing (Qwen-ASR)

By providing context, you can optimize the recognition of domain-specific vocabulary, such as names, places, and product terms. Length limit: The context content cannot exceed 10,000 tokens. Usage:
  • WebSocket API: Set the session.input_audio_transcription.corpus.text parameter in the session.update event.
  • Python SDK: Set the corpus_text parameter.
  • Java SDK: Set the corpusText parameter.
Supported text types: These include but are not limited to:
  • Hotword lists in various separator formats, such as Hotword 1, Hotword 2, Hotword 3, Hotword 4
  • Text paragraphs or chapters of any format and length
  • Mixed content: Any combination of word lists and paragraphs
  • Irrelevant or meaningless text, including garbled text. The feature is highly fault-tolerant and is almost never negatively affected by irrelevant text.
Example: The correct transcription of an audio segment should be: "What internal jargon from the investment banking circle do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB..."
Without context enhancementWith context enhancement
Without context enhancement, some investment bank names may be misrecognized. For example, "Bird Rock" should be "Bulge Bracket". Recognition result: "What internal jargon from the investment banking circle do you know? First, the nine major foreign investment banks, Bird Rock, BB..."With context enhancement, investment bank names are recognized correctly. Recognition result: "What internal jargon from the investment banking circle do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB..."
To achieve the result above, add any of the following content to the context:
  • Word lists:
    • Word list 1:
Bulge Bracket, Boutique, Middle Market, domestic securities firms
  • Word list 2:
Bulge Bracket Boutique Middle Market domestic securities firms
  • Word list 3:
['Bulge Bracket', 'Boutique', 'Middle Market', 'domestic securities firms']
  • Natural language:
Investment Banking Categories Revealed!
Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
Bulge Bracket Investment Banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large banks are enormous in both business scope and scale.
Boutique Investment Banks: These banks are relatively small but highly specialized in their business areas. For example, Lazard, Evercore, etc., have deep professional knowledge and experience in specific fields.
Middle Market Investment Banks: This type of bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major banks, they have a high influence in specific markets.
Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
In addition, there are some Position and business divisions, you can refer to the relevant charts. I hope this information helps you better understand investment banking and prepare for your future career!
  • Natural language with interference: Some text is irrelevant to the recognition content, such as the names in the example below.
Investment Banking Categories Revealed!
Recently, many friends from Australia have asked me, what exactly is an investment bank? Today, I'll explain it. For international students, investment banks can be mainly divided into four categories: Bulge Bracket, Boutique, Middle Market, and domestic securities firms.
Bulge Bracket Investment Banks: These are what we often call the nine major investment banks, including Goldman Sachs, Morgan Stanley, etc. These large banks are enormous in both business scope and scale.
Boutique Investment Banks: These banks are relatively small but highly specialized in their business areas. For example, Lazard, Evercore, etc., have deep professional knowledge and experience in specific fields.
Middle Market Investment Banks: This type of bank mainly serves medium-sized companies, providing services such as mergers and acquisitions, and IPOs. Although not as large as the major banks, they have a high influence in specific markets.
Domestic Securities Firms: With the rise of the Chinese market, domestic securities firms are also playing an increasingly important role in the international market.
In addition, there are some Position and business divisions, you can refer to the relevant charts. I hope this information helps you better understand investment banking and prepare for your future career!
Wang Haoxuan, Li Zihan, Zhang Jingxing, Liu Xinyi, Chen Junjie, Yang Siyuan, Zhao Yutong, Huang Zhiqiang, Zhou Zimo, Wu Yajing, Xu Ruoxi, Sun Haoran, Hu Jinyu, Zhu Chenxi, Guo Wenbo, He Jingshu, Gao Yuhang, Lin Yifei,
Zheng Xiaoyan, Liang Bowen, Luo Jiaqi, Song Mingzhe, Xie Wanting, Tang Ziqian, Han Mengyao, Feng Yiran, Cao Qinxue, Deng Zirui, Xiao Wangshu, Xu Jiashu,
Cheng Yinuo, Yuan Zhiruo, Peng Haoyu, Dong Simiao, Fan Jingyu, Su Zijin, Lv Wenxuan, Jiang Shihan, Ding Muchen,
Wei Shuyao, Ren Tianyou, Jiang Yichen, Hua Qingyu, Shen Xinghe, Fu Jinyu, Yao Xingchen, Zhong Lingyu, Yan Licheng, Jin Ruoshui, Taoranting, Qi Shaoshang, Xue Zhilan, Zou Yunfan, Xiong Ziang, Bai Wenfeng, Yi Qianfan

API reference

Interaction flow (Qwen-ASR-Realtime)

Qwen real-time speech recognition streams audio over WebSocket. Two modes are available: VAD mode (default) and Manual mode.

URL

Replace <model_name> with your model name.
wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=<model_name>

Headers

"Authorization": "Bearer $DASHSCOPE_API_KEY"

VAD mode (default)

The server detects speech boundaries and segments sentences. The client streams audio, and the server returns results when each sentence ends. Best for conversations and meeting transcription. Enable: Set session.turn_detection in session.update.
VAD mode interaction flow

Manual mode

The client controls sentence segmentation by sending audio for a complete sentence, then sending input_audio_buffer.commit. Best when the client knows sentence boundaries, for example in chat app voice messages. Enable: Set session.turn_detection to null in session.update.
Manual mode interaction flow

Alternative: Use Qwen-Omni

You can also use Qwen-Omni (qwen3-omni-flash-realtime) for real-time speech recognition via WebSocket. Omni is an LLM that understands audio — you provide domain context through the system prompt instead of hotword lists. When to use Omni for ASR: Clean speech inputs (microphone, voice calls) where you need domain-specific terminology handling via prompt. When to use dedicated ASR models instead: Noisy or mixed audio (meetings with background music, videos with sound effects), or when you need hotwords, speaker diarization, or timestamps.
Qwen-Omni interprets all audio, not just speech. Music, typing, or ambient noise may produce descriptions instead of transcription. For mixed audio, preprocess with VAD to isolate speech, or use a dedicated ASR model.
ASR prompt template:
messages = [
  {"role": "system", "content": "Transcribe the following audio exactly as spoken. Output only the transcription text. Ignore non-speech sounds."},
  {"role": "user", "content": [{"type": "input_audio", "input_audio": {"data": audio_data, "format": "wav"}}]}
]
Qwen-Omni-Realtime uses WebSocket for bidirectional streaming. For the full API and SDK reference, see Realtime conversation.