Skip to main content
Realtime

Fun-ASR realtime Java SDK

Real-time ASR Java SDK

User guide: For model overview and selection, see Real-time speech recognition.

Prerequisites

Model availability

ModelVersionUnit priceFree quota (Note)
fun-asr-realtime
Currently, fun-asr-realtime-2025-11-07
Stable$0.00009/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-realtime-2025-11-07Snapshot$0.00009/second36,000 seconds (10 hours)
Valid for 90 days
  • Supported languages: Mandarin (including 9 regional accents), Cantonese, Wu, Minnan, Hakka, Gan, Xiang, Jin, English, and Japanese
  • Supported sample rate: 16 kHz
  • Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr

Getting started

The Recognition class provides two interfaces:
  • Non-streaming: Returns the complete result at once. Use for pre-recorded audio.
  • Bidirectional streaming: Returns results in real time as audio streams in. Use for microphones or scenarios that need immediate feedback.

Non-streaming call

Pass a local file to synchronously receive the transcription. This call blocks the current thread. Instantiate the Recognition class and call call with request parameters and the file.
The audio file used in the example is asr_example.wav.
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;

public class Main {
  public static void main(String[] args) {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    // Create a Recognition instance.
    Recognition recognizer = new Recognition();
    // Create a RecognitionParam.
    RecognitionParam param =
        RecognitionParam.builder()
            .model("fun-asr-realtime")
            // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .format("wav")
            .sampleRate(16000)
            .parameter("language_hints", new String[]{"zh", "en"})
            .build();

    try {
      System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      // Close the WebSocket connection after the task is complete.
      recognizer.getDuplexApi().close(1000, "bye");
    }
    System.out.println(
        "[Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
    System.exit(0);
  }
}

Bidirectional streaming call: callback-based

Receive real-time results by implementing a callback interface.
1

Start streaming recognition

Instantiate the Recognition class and call call with the request parameters and the callback interface (ResultCallback).
2

Stream audio

Call sendAudioFrame repeatedly to stream audio segments (from a file or microphone). The server returns results through the onEvent callback.Send segments of ~100 ms duration, 1-16 KB each.
3

End recognition

Call stop to end recognition. This blocks until onComplete or onError is called.
  • Recognize speech from a microphone
  • Recognize a local audio file
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
  public static void main(String[] args) throws InterruptedException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    ExecutorService executorService = Executors.newSingleThreadExecutor();
    executorService.submit(new RealtimeRecognitionTask());
    executorService.shutdown();
    executorService.awaitTermination(1, TimeUnit.MINUTES);
    System.exit(0);
  }
}

class RealtimeRecognitionTask implements Runnable {
  @Override
  public void run() {
    RecognitionParam param = RecognitionParam.builder()
        .model("fun-asr-realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .format("wav")
        .sampleRate(16000)
        .build();
    Recognition recognizer = new Recognition();

    ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
      @Override
      public void onEvent(RecognitionResult result) {
        if (result.isSentenceEnd()) {
          System.out.println("Final Result: " + result.getSentence().getText());
        } else {
          System.out.println("Intermediate Result: " + result.getSentence().getText());
        }
      }

      @Override
      public void onComplete() {
        System.out.println("Recognition complete");
      }

      @Override
      public void onError(Exception e) {
        System.out.println("RecognitionCallback error: " + e.getMessage());
      }
    };
    try {
      recognizer.call(param, callback);
      // Create an audio format.
      AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
      // Match the default recording device based on the format.
      TargetDataLine targetDataLine =
          AudioSystem.getTargetDataLine(audioFormat);
      targetDataLine.open(audioFormat);
      // Start recording.
      targetDataLine.start();
      ByteBuffer buffer = ByteBuffer.allocate(1024);
      long start = System.currentTimeMillis();
      // Record for 50 seconds and perform real-time transcription.
      while (System.currentTimeMillis() - start < 50000) {
        int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
        if (read > 0) {
          buffer.limit(read);
          // Send the recorded audio data to the streaming recognition service.
          recognizer.sendAudioFrame(buffer);
          buffer = ByteBuffer.allocate(1024);
          // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
          Thread.sleep(20);
        }
      }
      recognizer.stop();
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      // Close the WebSocket connection after the task is complete.
      recognizer.getDuplexApi().close(1000, "bye");
    }

    System.out.println(
        "[Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
  }
}

Bidirectional streaming call: Flowable-based

Receive real-time results using a Flowable workflow. Flowable is a reactive stream type from RxJava that supports backpressure. See Flowable API reference.
Call streamCall to start recognition. It returns Flowable<RecognitionResult>. Use blockingForEach or subscribe to process results.streamCall requires:
  • RecognitionParam: Model, sample rate, and audio format
  • Flowable<ByteBuffer>: Audio stream
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;

public class Main {
  public static void main(String[] args) throws NoApiKeyException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    // Create a Flowable<ByteBuffer>.
    Flowable<ByteBuffer> audioSource =
        Flowable.create(
            emitter -> {
              new Thread(
                  () -> {
                    try {
                      // Create an audio format.
                      AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                      // Match the default recording device based on the format.
                      TargetDataLine targetDataLine =
                          AudioSystem.getTargetDataLine(audioFormat);
                      targetDataLine.open(audioFormat);
                      // Start recording.
                      targetDataLine.start();
                      ByteBuffer buffer = ByteBuffer.allocate(1024);
                      long start = System.currentTimeMillis();
                      // Record for 50 seconds and perform real-time transcription.
                      while (System.currentTimeMillis() - start < 50000) {
                        int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                        if (read > 0) {
                          buffer.limit(read);
                          // Send the recorded audio data to the streaming recognition service.
                          emitter.onNext(buffer);
                          buffer = ByteBuffer.allocate(1024);
                          // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                          Thread.sleep(20);
                        }
                      }
                      // Notify that the transcription is complete.
                      emitter.onComplete();
                    } catch (Exception e) {
                      emitter.onError(e);
                    }
                  })
                  .start();
            },
            BackpressureStrategy.BUFFER);

    // Create a Recognition instance.
    Recognition recognizer = new Recognition();
    // Create a RecognitionParam and pass the created Flowable<ByteBuffer> in the audioFrames parameter.
    RecognitionParam param = RecognitionParam.builder()
        .model("fun-asr-realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .format("pcm")
        .sampleRate(16000)
        .build();

    // Call the streaming interface.
    recognizer
        .streamCall(param, audioSource)
        .blockingForEach(
            result -> {
              // Subscribe to the output result.
              if (result.isSentenceEnd()) {
                System.out.println("Final Result: " + result.getSentence().getText());
              } else {
                System.out.println("Intermediate Result: " + result.getSentence().getText());
              }
            });
    // Close the WebSocket connection after the task is complete.
    recognizer.getDuplexApi().close(1000, "bye");
    System.out.println(
        "[Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
    System.exit(0);
  }
}

High-concurrency calls

The SDK uses OkHttp3 connection pooling to reduce overhead. See High-concurrency management.

Request parameters

Use RecognitionParam builder methods to configure the model, sample rate, and audio format. Pass the configured object to call or streamCall.
RecognitionParam param = RecognitionParam.builder()
  .model("fun-asr-realtime")
  .format("pcm")
  .sampleRate(16000)
  .parameter("language_hints", new String[]{"zh", "en"})
  .build();
ParameterTypeDefaultRequiredDescription
modelString-YesThe model for real-time speech recognition. See Model availability.
sampleRateInteger-YesAudio sample rate in Hz. fun-asr-realtime supports 16000 Hz.
formatString-YesAudio format. Supported: pcm, wav, mp3, opus, speex, aac, amr.
Important: opus/speex must use Ogg encapsulation. wav must be PCM encoded. amr: only AMR-NB is supported.
vocabularyIdString-NoCustom vocabulary ID. See Customize hotwords. Not set by default.
semantic_punctuation_enabledbooleanfalseNoPunctuation mode. true: semantic (higher accuracy, for meetings). false (default): VAD (lower latency, for interactive scenarios).
max_sentence_silenceInteger1300NoSilence threshold for VAD punctuation in ms. When silence after speech exceeds this value, the sentence ends. Range: 200 to 6000 ms. Only effective when semantic_punctuation_enabled is false.
multi_threshold_mode_enabledbooleanfalseNoPrevents VAD from segmenting long sentences prematurely. Only effective when semantic_punctuation_enabled is false.
punctuation_prediction_enabledbooleantrueNoAdds punctuation to the result automatically. true (default): cannot be modified.
heartbeatbooleanfalseNoMaintains a persistent connection during silence. true: connection stays alive. false (default): disconnects after 60 seconds, even with silent audio.
Requires SDK version 2.19.1 or later.
language_hintsString[]["zh", "en"]NoLanguage codes for recognition. Leave unset to auto-detect. Supported codes: zh (Chinese), en (English), ja (Japanese).
speech_noise_thresholdfloat-NoVAD sensitivity. Range: [-1.0, 1.0]. Near -1: more noise transcribed as speech. Near +1: some speech filtered as noise.
Important: Advanced parameter. Adjustments significantly affect quality. Test thoroughly. Make small adjustments (step size 0.1) based on your audio environment.
apiKeyString-NoYour API key.
For parameters not on the RecognitionParam builder directly (such as semantic_punctuation_enabled, heartbeat, and language_hints), use the parameter or parameters method:
  • Set using the parameter method
  • Set using the parameters method
RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("semantic_punctuation_enabled", true)
 .build();

Key interfaces

Recognition class

Import: import com.alibaba.dashscope.audio.asr.recognition.Recognition;
Interface/MethodParametersReturn valueDescription
public void call(RecognitionParam param, final ResultCallback<RecognitionResult> callback)param: Request parameters
callback: Callback interface (ResultCallback)
NoneCallback-based streaming recognition. Does not block the current thread.
public String call(RecognitionParam param, File file)param: Request parameters
file: Audio file to recognize
Recognition resultNon-streaming call with a local file. Blocks until processing completes. File must be readable.
public Flowable<RecognitionResult> streamCall(RecognitionParam param, Flowable<ByteBuffer> audioFrame)param: Request parameters
audioFrame: A Flowable<ByteBuffer> instance
Flowable<RecognitionResult>Flowable-based streaming recognition.
public void sendAudioFrame(ByteBuffer audioFrame)audioFrame: Binary audio stream (ByteBuffer)NoneSends an audio segment. Each packet should be ~100 ms in duration and 1-16 KB in size. Results arrive through the onEvent method of ResultCallback.
public void stop()NoneNoneStops recognition. Blocks until onComplete or onError is called.
boolean getDuplexApi().close(int code, String reason)code: WebSocket Close Code
reason: Reason for closing
See The WebSocket Protocol.
trueCloses the WebSocket connection after each task to prevent leaks (even on exceptions). To reuse connections, see High-concurrency management.
public String getLastRequestId()NonerequestIdGets the current task's request ID. Call after starting a task with call or streamingCall.
Requires SDK version 2.18.0 or later.
public long getFirstPackageDelay()NoneFirst-packet latencyGets the delay from when the first audio packet is sent to when the first result is received. Call after the task completes.
Requires SDK version 2.18.0 or later.
public long getLastPackageDelay()NoneLast-packet latencyGets the time from when stop is sent to when the last result is delivered. Call after the task completes.
Requires SDK version 2.18.0 or later.

Callback interface (ResultCallback)

In bidirectional streaming calls, implement callback methods to handle results from the server. Inherit ResultCallback<RecognitionResult> and implement its methods. RecognitionResult encapsulates the server response. Java supports connection reuse, so there are no onClose or onOpen methods.

Example

ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
  @Override
  public void onEvent(RecognitionResult result) {
    System.out.println("RequestId is: " + result.getRequestId());
    // Implement the logic to process the speech recognition result here.
  }

  @Override
  public void onComplete() {
    System.out.println("Task complete");
  }

  @Override
  public void onError(Exception e) {
    System.out.println("Task failed: " + e.getMessage());
  }
};
Interface/MethodParametersReturn valueDescription
public void onEvent(RecognitionResult result)result: Real-time recognition result (RecognitionResult)NoneCalled when the server returns a result.
public void onComplete()NoneNoneCalled when recognition completes successfully.
public void onError(Exception e)e: Exception informationNoneCalled when an error occurs.

Response results

Real-time recognition result (RecognitionResult)

RecognitionResult represents a single recognition result.
Interface/MethodParametersReturn valueDescription
public String getRequestId()NonerequestIdGets the request ID.
public boolean isSentenceEnd()NoneWhether the sentence has endedReturns whether the current sentence has ended (final result).
public Sentence getSentence()NoneSentenceGets sentence information including timestamps and text.

Sentence information (Sentence)

Interface/MethodParametersReturn valueDescription
public Long getBeginTime()NoneSentence start time, in msReturns the sentence start time.
public Long getEndTime()NoneSentence end time, in msReturns the sentence end time.
public String getText()NoneRecognized textReturns the recognized text.
public List<Word> getWords()NoneA List of Word timestamp information (Word)Returns word timestamp information.

Word timestamp information (Word)

Interface/MethodParametersReturn valueDescription
public long getBeginTime()NoneWord start time, in msReturns the word start time.
public long getEndTime()NoneWord end time, in msReturns the word end time.
public String getText()NoneWordReturns the recognized word.
public String getPunctuation()NonePunctuationReturns the punctuation.
Fun-ASR realtime Java SDK | Qwen Cloud