Qwen-ASR realtime Java SDK

User guide: For the model overview, features, and complete sample code, see Realtime speech recognition.

Interaction modes

Qwen-ASR-Realtime supports two modes:

Mode	`enableTurnDetection`	How it works
VAD mode (default)	`true`	The server detects speech boundaries with VAD and commits the audio buffer for recognition.
Manual mode	`false`	You control when to commit audio by calling `commit()`.

For details, see VAD mode and Manual mode.

Request parameters

Connection parameters (OmniRealtimeParam)

Set with OmniRealtimeParam chained methods.

Click to view sample code

OmniRealtimeParam param = OmniRealtimeParam.builder()
  .model("qwen3-asr-flash-realtime")
  .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
  // If you have not configured an environment variable, replace the following line with .apikey("sk-xxx").
  .apikey(System.getenv("DASHSCOPE_API_KEY"))
  .build();

Parameter	Type	Required	Description
`model`	`String`	Yes	Model name. Example: `qwen3-asr-flash-realtime`.
`url`	`String`	Yes	Service endpoint: `wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime`.
`apikey`	`String`	No	API key.

Session configuration (OmniRealtimeConfig)

Set with OmniRealtimeConfig chained methods.

Click to view sample code

OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");

OmniRealtimeConfig config = OmniRealtimeConfig.builder()
  .modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
  .enableTurnDetection(true)
  .turnDetectionType("server_vad")
  .turnDetectionThreshold(0.0f)
  .turnDetectionSilenceDurationMs(400)
  .transcriptionConfig(transcriptionParam)
  .build();

Parameter	Type	Required	Description
`modalities`	`List<OmniRealtimeModality>`	Yes	Output modality. Fixed to `[OmniRealtimeModality.TEXT]`.
`enableTurnDetection`	`boolean`	No	Enables server-side VAD. When disabled, call `commit()` to trigger recognition manually. Default: `true`.
`turnDetectionType`	`String`	No	VAD type. Fixed to `server_vad`.
`turnDetectionThreshold`	`float`	No	VAD threshold. Recommended: `0.0`. Default: `0.2`. Valid range: `[-1, 1]`. Lower values increase sensitivity but risk treating background noise as speech. Higher values reduce sensitivity, which suits noisy environments.
`turnDetectionSilenceDurationMs`	`int`	No	VAD silence threshold in milliseconds. Silence beyond this duration marks the end of a statement. Recommended: `400`. Default: `800`. Valid range: `[200, 6000]`. Lower values (such as 300 ms) speed up the response but may split at normal pauses. Higher values (such as 1200 ms) handle long pauses better but add latency.
`transcriptionConfig`	`OmniRealtimeTranscriptionParam`	No	Speech recognition settings. See Transcription parameters.

Transcription parameters (OmniRealtimeTranscriptionParam)

Set with OmniRealtimeTranscriptionParam setter methods.

Click to view sample code

OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");

Parameter	Type	Required	Description
`language`	`String`	No	Source language of the audio. Supported values: `zh` (Chinese - Mandarin, Sichuanese, Minnan, and Wu), `yue` (Cantonese), `en` (English), `ja` (Japanese), `de` (German), `ko` (Korean), `ru` (Russian), `fr` (French), `pt` (Portuguese), `ar` (Arabic), `it` (Italian), `es` (Spanish), `hi` (Hindi), `id` (Indonesian), `th` (Thai), `tr` (Turkish), `uk` (Ukrainian), `vi` (Vietnamese), `cs` (Czech), `da` (Danish), `fil` (Filipino), `fi` (Finnish), `is` (Icelandic), `ms` (Malay), `no` (Norwegian), `pl` (Polish), `sv` (Swedish).
`inputSampleRate`	`int`	No	Audio sample rate in Hz. Supported: `16000`, `8000`. Default: `16000`. Setting `8000` triggers server-side upsampling to 16000 Hz, which may add a minor delay. Use `8000` only for 8000 Hz source audio (such as telephone lines).
`inputAudioFormat`	`String`	No	Audio format. Supported: `pcm`, `opus`. Default: `pcm`.
`corpusText`	`String`	No	Context (background text, entity vocabularies, reference information) to customize recognition. Limit: 10,000 tokens. See Context biasing.

Key interfaces

OmniRealtimeConversation

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeConversation Manages the WebSocket lifecycle: connect, send audio, and end the session.

Create a conversation

OmniRealtimeConversation conversation =
  new OmniRealtimeConversation(param, callback);

Creates a conversation with the specified connection parameters and callback handler.

Connect to the server

conversation.connect();

Opens a WebSocket connection. The server sends session.created and session.updated events. Throws: NoApiKeyException, InterruptedException.

Configure the session

conversation.updateSession(config);

Updates the session configuration after connecting. The server sends a session.updated event. If not called, the server uses defaults.

Send audio data

conversation.appendAudio(audioBase64);

Appends Base64-encoded audio to the server-side audio buffer.

VAD mode (enableTurnDetection=true): The server detects speech boundaries and decides when to process the buffer.
Manual mode (enableTurnDetection=false): Audio accumulates until you call commit(). Each event can contain up to 15 MiB of audio data.

Commit the audio buffer

conversation.commit();

Submits buffered audio for recognition. The server sends an input_audio_buffer.committed event.

Only available in manual mode (enableTurnDetection=false). Returns an error if the audio buffer is empty.

End the session

conversation.endSession();  // synchronous
// or
conversation.endSessionAsync();  // asynchronous

Tells the server to finish processing remaining audio and end the session. The server sends a session.finished event. When to call:

VAD mode: After you finish sending audio.
Manual mode: After you call commit().

Close the connection

conversation.close();

Stops the task and closes the WebSocket connection immediately.

Get session and response IDs

String sessionId = conversation.getSessionId();
String responseId = conversation.getResponseId();

getSessionId() returns the session ID for the current task.
getResponseId() returns the response ID from the most recent server response.

OmniRealtimeCallback

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeCallback Implement this class to handle server events.

Method	Parameters	Triggered when
`onOpen()`	None	WebSocket connection established.
`onEvent(JsonObject message)`	`message`: A server event as JSON. Common types: `session.created`, `session.updated`, `input_audio_buffer.committed`, `conversation.item.input_audio_transcription.completed`, `session.finished`.	A server event is received. Parse the `type` field to determine the event type.
`onClose(int code, String reason)`	`code`: Status code. `reason`: Close reason.	WebSocket connection closed.

​Interaction modes

​Request parameters

​Connection parameters (OmniRealtimeParam)

​Session configuration (OmniRealtimeConfig)

​Transcription parameters (OmniRealtimeTranscriptionParam)

​Key interfaces

​OmniRealtimeConversation

​Create a conversation

​Connect to the server

​Configure the session

​Send audio data

​Commit the audio buffer

​End the session

​Close the connection

​Get session and response IDs

​OmniRealtimeCallback

Interaction modes

Request parameters

Connection parameters (OmniRealtimeParam)

Session configuration (OmniRealtimeConfig)

Transcription parameters (OmniRealtimeTranscriptionParam)

Key interfaces

OmniRealtimeConversation

Create a conversation

Connect to the server

Configure the session

Send audio data

Commit the audio buffer

End the session

Close the connection

Get session and response IDs

OmniRealtimeCallback