Real-time ASR Java SDK
User guide: For model overview and selection, see Real-time speech recognition.
The Recognition class provides two interfaces:
Pass a local file to synchronously receive the transcription. This call blocks the current thread.
Instantiate the Recognition class and call
Receive real-time results by implementing a callback interface.
Receive real-time results using a Flowable workflow.
Flowable is a reactive stream type from RxJava that supports backpressure. See Flowable API reference.
The SDK uses OkHttp3 connection pooling to reduce overhead. See High-concurrency management.
Use
Import:
In bidirectional streaming calls, implement callback methods to handle results from the server.
Inherit
Prerequisites
Model availability
| Model | Version | Unit price | Free quota (Note) |
|---|---|---|---|
| fun-asr-realtime Currently, fun-asr-realtime-2025-11-07 | Stable | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-realtime-2025-11-07 | Snapshot | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
- Supported languages: Mandarin (including 9 regional accents), Cantonese, Wu, Minnan, Hakka, Gan, Xiang, Jin, English, and Japanese
- Supported sample rate: 16 kHz
- Supported audio formats: pcm, wav, mp3, opus, speex, aac, amr
Getting started
The Recognition class provides two interfaces:
- Non-streaming: Returns the complete result at once. Use for pre-recorded audio.
- Bidirectional streaming: Returns results in real time as audio streams in. Use for microphones or scenarios that need immediate feedback.
Non-streaming call
Pass a local file to synchronously receive the transcription. This call blocks the current thread.
Instantiate the Recognition class and call call with request parameters and the file.
Click to view the complete example
Click to view the complete example
The audio file used in the example is asr_example.wav.
Bidirectional streaming call: callback-based
Receive real-time results by implementing a callback interface.
1
Start streaming recognition
Instantiate the Recognition class and call
call with the request parameters and the callback interface (ResultCallback).2
Stream audio
Call
sendAudioFrame repeatedly to stream audio segments (from a file or microphone). The server returns results through the onEvent callback.Send segments of ~100 ms duration, 1-16 KB each.3
End recognition
Call
stop to end recognition. This blocks until onComplete or onError is called.Click to view the complete example
Click to view the complete example
- Recognize speech from a microphone
- Recognize a local audio file
Bidirectional streaming call: Flowable-based
Receive real-time results using a Flowable workflow.
Flowable is a reactive stream type from RxJava that supports backpressure. See Flowable API reference.
Click to view the complete example
Click to view the complete example
Call
streamCall to start recognition. It returns Flowable<RecognitionResult>. Use blockingForEach or subscribe to process results.streamCall requires:RecognitionParam: Model, sample rate, and audio formatFlowable<ByteBuffer>: Audio stream
High-concurrency calls
The SDK uses OkHttp3 connection pooling to reduce overhead. See High-concurrency management.
Request parameters
Use RecognitionParam builder methods to configure the model, sample rate, and audio format. Pass the configured object to call or streamCall.
Click to view an example
Click to view an example
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
| model | String | - | Yes | The model for real-time speech recognition. See Model availability. |
| sampleRate | Integer | - | Yes | Audio sample rate in Hz. fun-asr-realtime supports 16000 Hz. |
| format | String | - | Yes | Audio format. Supported: pcm, wav, mp3, opus, speex, aac, amr. Important: opus/speex must use Ogg encapsulation. wav must be PCM encoded. amr: only AMR-NB is supported. |
| vocabularyId | String | - | No | Custom vocabulary ID. See Customize hotwords. Not set by default. |
| semantic_punctuation_enabled | boolean | false | No | Punctuation mode. true: semantic (higher accuracy, for meetings). false (default): VAD (lower latency, for interactive scenarios). |
| max_sentence_silence | Integer | 1300 | No | Silence threshold for VAD punctuation in ms. When silence after speech exceeds this value, the sentence ends. Range: 200 to 6000 ms. Only effective when semantic_punctuation_enabled is false. |
| multi_threshold_mode_enabled | boolean | false | No | Prevents VAD from segmenting long sentences prematurely. Only effective when semantic_punctuation_enabled is false. |
| punctuation_prediction_enabled | boolean | true | No | Adds punctuation to the result automatically. true (default): cannot be modified. |
| heartbeat | boolean | false | No | Maintains a persistent connection during silence. true: connection stays alive. false (default): disconnects after 60 seconds, even with silent audio. Requires SDK version 2.19.1 or later. |
| language_hints | String[] | ["zh", "en"] | No | Language codes for recognition. Leave unset to auto-detect. Supported codes: zh (Chinese), en (English), ja (Japanese). |
| speech_noise_threshold | float | - | No | VAD sensitivity. Range: [-1.0, 1.0]. Near -1: more noise transcribed as speech. Near +1: some speech filtered as noise. Important: Advanced parameter. Adjustments significantly affect quality. Test thoroughly. Make small adjustments (step size 0.1) based on your audio environment. |
| apiKey | String | - | No | Your API key. |
For parameters not on the
RecognitionParam builder directly (such as semantic_punctuation_enabled, heartbeat, and language_hints), use the parameter or parameters method:- Set using the parameter method
- Set using the parameters method
Key interfaces
Recognition class
Import: import com.alibaba.dashscope.audio.asr.recognition.Recognition;
| Interface/Method | Parameters | Return value | Description |
|---|---|---|---|
public void call(RecognitionParam param, final ResultCallback<RecognitionResult> callback) | param: Request parameters callback: Callback interface (ResultCallback) | None | Callback-based streaming recognition. Does not block the current thread. |
public String call(RecognitionParam param, File file) | param: Request parameters file: Audio file to recognize | Recognition result | Non-streaming call with a local file. Blocks until processing completes. File must be readable. |
public Flowable<RecognitionResult> streamCall(RecognitionParam param, Flowable<ByteBuffer> audioFrame) | param: Request parameters audioFrame: A Flowable<ByteBuffer> instance | Flowable<RecognitionResult> | Flowable-based streaming recognition. |
public void sendAudioFrame(ByteBuffer audioFrame) | audioFrame: Binary audio stream (ByteBuffer) | None | Sends an audio segment. Each packet should be ~100 ms in duration and 1-16 KB in size. Results arrive through the onEvent method of ResultCallback. |
public void stop() | None | None | Stops recognition. Blocks until onComplete or onError is called. |
boolean getDuplexApi().close(int code, String reason) | code: WebSocket Close Code reason: Reason for closing See The WebSocket Protocol. | true | Closes the WebSocket connection after each task to prevent leaks (even on exceptions). To reuse connections, see High-concurrency management. |
public String getLastRequestId() | None | requestId | Gets the current task's request ID. Call after starting a task with call or streamingCall. Requires SDK version 2.18.0 or later. |
public long getFirstPackageDelay() | None | First-packet latency | Gets the delay from when the first audio packet is sent to when the first result is received. Call after the task completes. Requires SDK version 2.18.0 or later. |
public long getLastPackageDelay() | None | Last-packet latency | Gets the time from when stop is sent to when the last result is delivered. Call after the task completes. Requires SDK version 2.18.0 or later. |
Callback interface (ResultCallback)
In bidirectional streaming calls, implement callback methods to handle results from the server.
Inherit ResultCallback<RecognitionResult> and implement its methods. RecognitionResult encapsulates the server response.
Java supports connection reuse, so there are no onClose or onOpen methods.
Example
Example
| Interface/Method | Parameters | Return value | Description |
|---|---|---|---|
public void onEvent(RecognitionResult result) | result: Real-time recognition result (RecognitionResult) | None | Called when the server returns a result. |
public void onComplete() | None | None | Called when recognition completes successfully. |
public void onError(Exception e) | e: Exception information | None | Called when an error occurs. |
Response results
Real-time recognition result (RecognitionResult)
RecognitionResult represents a single recognition result.
| Interface/Method | Parameters | Return value | Description |
|---|---|---|---|
public String getRequestId() | None | requestId | Gets the request ID. |
public boolean isSentenceEnd() | None | Whether the sentence has ended | Returns whether the current sentence has ended (final result). |
public Sentence getSentence() | None | Sentence | Gets sentence information including timestamps and text. |
Sentence information (Sentence)
| Interface/Method | Parameters | Return value | Description |
|---|---|---|---|
public Long getBeginTime() | None | Sentence start time, in ms | Returns the sentence start time. |
public Long getEndTime() | None | Sentence end time, in ms | Returns the sentence end time. |
public String getText() | None | Recognized text | Returns the recognized text. |
public List<Word> getWords() | None | A List of Word timestamp information (Word) | Returns word timestamp information. |
Word timestamp information (Word)
| Interface/Method | Parameters | Return value | Description |
|---|---|---|---|
public long getBeginTime() | None | Word start time, in ms | Returns the word start time. |
public long getEndTime() | None | Word end time, in ms | Returns the word end time. |
public String getText() | None | Word | Returns the recognized word. |
public String getPunctuation() | None | Punctuation | Returns the punctuation. |