Fun-ASR realtime Java SDK

User guide: For model overview and selection, see Real-time speech recognition.

Getting started

The Recognition class provides two interfaces:

Non-streaming: Returns the complete result at once. Use for pre-recorded audio.
Bidirectional streaming: Returns results in real time as audio streams in. Use for microphones or scenarios that need immediate feedback.

Non-streaming call

Pass a local file to synchronously receive the transcription. This call blocks the current thread. Instantiate the Recognition class and call call with request parameters and the file.

Click to view the complete example

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.utils.Constants;

import java.io.File;

public class Main {
  public static void main(String[] args) {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    // Create a Recognition instance.
    Recognition recognizer = new Recognition();
    // Create a RecognitionParam.
    RecognitionParam param =
        RecognitionParam.builder()
            .model("fun-asr-realtime")
            // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
            .apiKey(System.getenv("DASHSCOPE_API_KEY"))
            .format("wav")
            .sampleRate(16000)
            .parameter("language_hints", new String[]{"zh", "en"})
            .build();

    try {
      System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      // Close the WebSocket connection after the task is complete.
      recognizer.getDuplexApi().close(1000, "bye");
    }
    System.out.println(
        "[Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
    System.exit(0);
  }
}

Bidirectional streaming call: callback-based

Receive real-time results by implementing a callback interface.

Start streaming recognition

Instantiate the Recognition class and call call with the request parameters and the callback interface (ResultCallback).

Stream audio

Call sendAudioFrame repeatedly to stream audio segments (from a file or microphone). The server returns results through the onEvent callback.Send segments of ~100 ms duration, 1-16 KB each.

End recognition

Call stop to end recognition. This blocks until onComplete or onError is called.

Click to view the complete example

Recognize speech from a microphone
Recognize a local audio file

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
  public static void main(String[] args) throws InterruptedException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    ExecutorService executorService = Executors.newSingleThreadExecutor();
    executorService.submit(new RealtimeRecognitionTask());
    executorService.shutdown();
    executorService.awaitTermination(1, TimeUnit.MINUTES);
    System.exit(0);
  }
}

class RealtimeRecognitionTask implements Runnable {
  @Override
  public void run() {
    RecognitionParam param = RecognitionParam.builder()
        .model("fun-asr-realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .format("wav")
        .sampleRate(16000)
        .build();
    Recognition recognizer = new Recognition();

    ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
      @Override
      public void onEvent(RecognitionResult result) {
        if (result.isSentenceEnd()) {
          System.out.println("Final Result: " + result.getSentence().getText());
        } else {
          System.out.println("Intermediate Result: " + result.getSentence().getText());
        }
      }

      @Override
      public void onComplete() {
        System.out.println("Recognition complete");
      }

      @Override
      public void onError(Exception e) {
        System.out.println("RecognitionCallback error: " + e.getMessage());
      }
    };
    try {
      recognizer.call(param, callback);
      // Create an audio format.
      AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
      // Match the default recording device based on the format.
      TargetDataLine targetDataLine =
          AudioSystem.getTargetDataLine(audioFormat);
      targetDataLine.open(audioFormat);
      // Start recording.
      targetDataLine.start();
      ByteBuffer buffer = ByteBuffer.allocate(1024);
      long start = System.currentTimeMillis();
      // Record for 50 seconds and perform real-time transcription.
      while (System.currentTimeMillis() - start < 50000) {
        int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
        if (read > 0) {
          buffer.limit(read);
          // Send the recorded audio data to the streaming recognition service.
          recognizer.sendAudioFrame(buffer);
          buffer = ByteBuffer.allocate(1024);
          // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
          Thread.sleep(20);
        }
      }
      recognizer.stop();
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      // Close the WebSocket connection after the task is complete.
      recognizer.getDuplexApi().close(1000, "bye");
    }

    System.out.println(
        "[Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
  }
}

The audio file used in the example is asr_example.wav.

import com.alibaba.dashscope.api.GeneralApi;
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.base.HalfDuplexParamBase;
import com.alibaba.dashscope.common.GeneralListParam;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.protocol.GeneralServiceOption;
import com.alibaba.dashscope.protocol.HttpMethod;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.protocol.StreamingMode;
import com.alibaba.dashscope.utils.Constants;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class TimeUtils {
  private static final DateTimeFormatter formatter =
      DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");

  public static String getTimestamp() {
    return LocalDateTime.now().format(formatter);
  }
}

public class Main {
  public static void main(String[] args) throws InterruptedException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    // In a real application, call this method only once at program startup.
    warmUp();

    ExecutorService executorService = Executors.newSingleThreadExecutor();
    executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
    executorService.shutdown();

    // Wait for all tasks to complete.
    executorService.awaitTermination(1, TimeUnit.MINUTES);
    System.exit(0);
  }

  public static void warmUp() {
    try {
      // Lightweight GET request to establish a connection.
      GeneralServiceOption warmupOption = GeneralServiceOption.builder()
          .protocol(Protocol.HTTP)
          .httpMethod(HttpMethod.GET)
          .streamingMode(StreamingMode.OUT)
          .path("assistants")
          .build();

      warmupOption.setBaseHttpUrl(Constants.baseHttpApiUrl);
      GeneralApi<HalfDuplexParamBase> api = new GeneralApi<>();
      api.get(GeneralListParam.builder().limit(1L).build(), warmupOption);
    } catch (Exception e) {
      // Allow retry if warm-up fails.
    }
  }
}

class RealtimeRecognitionTask implements Runnable {
  private Path filepath;

  public RealtimeRecognitionTask(Path filepath) {
    this.filepath = filepath;
  }

  @Override
  public void run() {
    RecognitionParam param = RecognitionParam.builder()
        .model("fun-asr-realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .format("wav")
        .sampleRate(16000)
        .build();
    Recognition recognizer = new Recognition();

    String threadName = Thread.currentThread().getName();

    ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
      @Override
      public void onEvent(RecognitionResult message) {
        if (message.isSentenceEnd()) {

          System.out.println(TimeUtils.getTimestamp()+" "+
              "[process " + threadName + "] Final Result:" + message.getSentence().getText());
        } else {
          System.out.println(TimeUtils.getTimestamp()+" "+
              "[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
        }
      }

      @Override
      public void onComplete() {
        System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
      }

      @Override
      public void onError(Exception e) {
        System.out.println(TimeUtils.getTimestamp()+" "+
            "[" + threadName + "] RecognitionCallback error: " + e.getMessage());
      }
    };

    try {
      recognizer.call(param, callback);
      // Replace the path with your audio file path.
      System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
      // Read the file and send audio in chunks.
      FileInputStream fis = new FileInputStream(this.filepath.toFile());
      byte[] allData = new byte[fis.available()];
      int ret = fis.read(allData);
      fis.close();

      int sendFrameLength = 3200;
      for (int i = 0; i * sendFrameLength < allData.length; i ++) {
        int start = i * sendFrameLength;
        int end = Math.min(start + sendFrameLength, allData.length);
        ByteBuffer byteBuffer = ByteBuffer.wrap(allData, start, end - start);
        recognizer.sendAudioFrame(byteBuffer);
        Thread.sleep(100);
      }

      System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
      recognizer.stop();
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      // Close the WebSocket connection after the task is complete.
      recognizer.getDuplexApi().close(1000, "bye");
    }

    System.out.println(
        "["
            + threadName
            + "][Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
  }
}

Bidirectional streaming call: Flowable-based

Receive real-time results using a Flowable workflow. Flowable is a reactive stream type from RxJava that supports backpressure. See Flowable API reference.

Click to view the complete example

Call streamCall to start recognition. It returns Flowable<RecognitionResult>. Use blockingForEach or subscribe to process results.streamCall requires:

RecognitionParam: Model, sample rate, and audio format
Flowable<ByteBuffer>: Audio stream

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;

public class Main {
  public static void main(String[] args) throws NoApiKeyException {
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
    // Create a Flowable<ByteBuffer>.
    Flowable<ByteBuffer> audioSource =
        Flowable.create(
            emitter -> {
              new Thread(
                  () -> {
                    try {
                      // Create an audio format.
                      AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                      // Match the default recording device based on the format.
                      TargetDataLine targetDataLine =
                          AudioSystem.getTargetDataLine(audioFormat);
                      targetDataLine.open(audioFormat);
                      // Start recording.
                      targetDataLine.start();
                      ByteBuffer buffer = ByteBuffer.allocate(1024);
                      long start = System.currentTimeMillis();
                      // Record for 50 seconds and perform real-time transcription.
                      while (System.currentTimeMillis() - start < 50000) {
                        int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                        if (read > 0) {
                          buffer.limit(read);
                          // Send the recorded audio data to the streaming recognition service.
                          emitter.onNext(buffer);
                          buffer = ByteBuffer.allocate(1024);
                          // The recording rate is limited. Sleep for a short period to prevent high CPU usage.
                          Thread.sleep(20);
                        }
                      }
                      // Notify that the transcription is complete.
                      emitter.onComplete();
                    } catch (Exception e) {
                      emitter.onError(e);
                    }
                  })
                  .start();
            },
            BackpressureStrategy.BUFFER);

    // Create a Recognition instance.
    Recognition recognizer = new Recognition();
    // Create a RecognitionParam and pass the created Flowable<ByteBuffer> in the audioFrames parameter.
    RecognitionParam param = RecognitionParam.builder()
        .model("fun-asr-realtime")
        // If you have not configured an environment variable, replace the following line with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .format("pcm")
        .sampleRate(16000)
        .build();

    // Call the streaming interface.
    recognizer
        .streamCall(param, audioSource)
        .blockingForEach(
            result -> {
              // Subscribe to the output result.
              if (result.isSentenceEnd()) {
                System.out.println("Final Result: " + result.getSentence().getText());
              } else {
                System.out.println("Intermediate Result: " + result.getSentence().getText());
              }
            });
    // Close the WebSocket connection after the task is complete.
    recognizer.getDuplexApi().close(1000, "bye");
    System.out.println(
        "[Metric] requestId: "
            + recognizer.getLastRequestId()
            + ", first package delay ms: "
            + recognizer.getFirstPackageDelay()
            + ", last package delay ms: "
            + recognizer.getLastPackageDelay());
    System.exit(0);
  }
}

High-concurrency calls

The SDK uses OkHttp3 connection pooling to reduce overhead. See High-concurrency management.

Request parameters

Use RecognitionParam builder methods to configure the model, sample rate, and audio format. Pass the configured object to call or streamCall.

Click to view an example

RecognitionParam param = RecognitionParam.builder()
  .model("fun-asr-realtime")
  .format("pcm")
  .sampleRate(16000)
  .parameter("language_hints", new String[]{"zh", "en"})
  .build();

Parameter	Type	Default	Required	Description
model	String	-	Yes	The model for real-time speech recognition.
sampleRate	Integer	-	Yes	Audio sample rate in Hz. fun-asr-realtime supports 16000 Hz.
format	String	-	Yes	Audio format. Supported: pcm, wav, mp3, opus, speex, aac, amr. Important: opus/speex must use Ogg encapsulation. wav must be PCM encoded. amr: only AMR-NB is supported.
vocabularyId	String	-	No	Custom vocabulary ID. See Customize hotwords. Not set by default.
semantic_punctuation_enabled	boolean	false	No	Punctuation mode. `true`: semantic (higher accuracy, for meetings). `false` (default): VAD (lower latency, for interactive scenarios).
max_sentence_silence	Integer	1300	No	Silence threshold for VAD punctuation in ms. When silence after speech exceeds this value, the sentence ends. Range: 200 to 6000 ms. Only effective when `semantic_punctuation_enabled` is `false`.
multi_threshold_mode_enabled	boolean	false	No	Prevents VAD from segmenting long sentences prematurely. Only effective when `semantic_punctuation_enabled` is `false`.
punctuation_prediction_enabled	boolean	true	No	Adds punctuation to the result automatically. `true` (default): cannot be modified.
heartbeat	boolean	false	No	Maintains a persistent connection during silence. `true`: connection stays alive. `false` (default): disconnects after 60 seconds, even with silent audio. Requires SDK version 2.19.1 or later.
language_hints	String[]	`["zh", "en"]`	No	Language codes for recognition. Leave unset to auto-detect. Supported codes: `zh` (Chinese), `en` (English), `ja` (Japanese).
speech_noise_threshold	float	-	No	VAD sensitivity. Range: [-1.0, 1.0]. Near -1: more noise transcribed as speech. Near +1: some speech filtered as noise. Important: Advanced parameter. Adjustments significantly affect quality. Test thoroughly. Make small adjustments (step size 0.1) based on your audio environment.
apiKey	String	-	No	Your API key.

For parameters not on the RecognitionParam builder directly (such as semantic_punctuation_enabled, heartbeat, and language_hints), use the parameter or parameters method:

Set using the parameter method
Set using the parameters method

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameter("semantic_punctuation_enabled", true)
 .build();

RecognitionParam param = RecognitionParam.builder()
 .model("fun-asr-realtime")
 .format("pcm")
 .sampleRate(16000)
 .parameters(Collections.singletonMap("semantic_punctuation_enabled", true))
 .build();

Key interfaces

`Recognition` class

Import: import com.alibaba.dashscope.audio.asr.recognition.Recognition;

Interface/Method	Parameters	Return value	Description
`public void call(RecognitionParam param, final ResultCallback<RecognitionResult> callback)`	`param`: Request parameters `callback`: Callback interface (ResultCallback)	None	Callback-based streaming recognition. Does not block the current thread.
`public String call(RecognitionParam param, File file)`	`param`: Request parameters `file`: Audio file to recognize	Recognition result	Non-streaming call with a local file. Blocks until processing completes. File must be readable.
`public Flowable<RecognitionResult> streamCall(RecognitionParam param, Flowable<ByteBuffer> audioFrame)`	`param`: Request parameters `audioFrame`: A `Flowable<ByteBuffer>` instance	`Flowable<RecognitionResult>`	Flowable-based streaming recognition.
`public void sendAudioFrame(ByteBuffer audioFrame)`	`audioFrame`: Binary audio stream (`ByteBuffer`)	None	Sends an audio segment. Each packet should be ~100 ms in duration and 1-16 KB in size. Results arrive through the `onEvent` method of ResultCallback.
`public void stop()`	None	None	Stops recognition. Blocks until `onComplete` or `onError` is called.
`boolean getDuplexApi().close(int code, String reason)`	`code`: WebSocket Close Code `reason`: Reason for closing See The WebSocket Protocol.	`true`	Closes the WebSocket connection after each task to prevent leaks (even on exceptions). To reuse connections, see High-concurrency management.
`public String getLastRequestId()`	None	requestId	Gets the current task's request ID. Call after starting a task with `call` or `streamingCall`. Requires SDK version 2.18.0 or later.
`public long getFirstPackageDelay()`	None	First-packet latency	Gets the delay from when the first audio packet is sent to when the first result is received. Call after the task completes. Requires SDK version 2.18.0 or later.
`public long getLastPackageDelay()`	None	Last-packet latency	Gets the time from when `stop` is sent to when the last result is delivered. Call after the task completes. Requires SDK version 2.18.0 or later.

Callback interface (`ResultCallback`)

In bidirectional streaming calls, implement callback methods to handle results from the server. Inherit ResultCallback<RecognitionResult> and implement its methods. RecognitionResult encapsulates the server response. Java supports connection reuse, so there are no onClose or onOpen methods.

Example

ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
  @Override
  public void onEvent(RecognitionResult result) {
    System.out.println("RequestId is: " + result.getRequestId());
    // Implement the logic to process the speech recognition result here.
  }

  @Override
  public void onComplete() {
    System.out.println("Task complete");
  }

  @Override
  public void onError(Exception e) {
    System.out.println("Task failed: " + e.getMessage());
  }
};

Interface/Method	Parameters	Return value	Description
`public void onEvent(RecognitionResult result)`	`result`: Real-time recognition result (RecognitionResult)	None	Called when the server returns a result.
`public void onComplete()`	None	None	Called when recognition completes successfully.
`public void onError(Exception e)`	`e`: Exception information	None	Called when an error occurs.

Response results

Real-time recognition result (`RecognitionResult`)

RecognitionResult represents a single recognition result.

Interface/Method	Parameters	Return value	Description
`public String getRequestId()`	None	requestId	Gets the request ID.
`public boolean isSentenceEnd()`	None	Whether the sentence has ended	Returns whether the current sentence has ended (final result).
`public Sentence getSentence()`	None	Sentence	Gets sentence information including timestamps and text.

Sentence information (`Sentence`)

Interface/Method	Parameters	Return value	Description
`public Long getBeginTime()`	None	Sentence start time, in ms	Returns the sentence start time.
`public Long getEndTime()`	None	Sentence end time, in ms	Returns the sentence end time.
`public String getText()`	None	Recognized text	Returns the recognized text.
`public List<Word> getWords()`	None	A List of Word timestamp information (Word)	Returns word timestamp information.

Word timestamp information (`Word`)

Interface/Method	Parameters	Return value	Description
`public long getBeginTime()`	None	Word start time, in ms	Returns the word start time.
`public long getEndTime()`	None	Word end time, in ms	Returns the word end time.
`public String getText()`	None	Word	Returns the recognized word.
`public String getPunctuation()`	None	Punctuation	Returns the punctuation.

​Getting started

​Non-streaming call

​Bidirectional streaming call: callback-based

​Bidirectional streaming call: Flowable-based

​High-concurrency calls

​Request parameters

​Key interfaces

​Recognition class

​Callback interface (ResultCallback)

​Response results

​Real-time recognition result (RecognitionResult)

​Sentence information (Sentence)

​Word timestamp information (Word)

Getting started

Non-streaming call

Bidirectional streaming call: callback-based

Bidirectional streaming call: Flowable-based

High-concurrency calls

Request parameters

Key interfaces

`Recognition` class

Callback interface (`ResultCallback`)

Response results

Real-time recognition result (`RecognitionResult`)

Sentence information (`Sentence`)

Word timestamp information (`Word`)