Skip to main content
Realtime

Qwen-ASR realtime Java SDK

Qwen ASR Java integration

User guide: For the model overview, features, and complete sample code, see Realtime speech recognition.

Prerequisites

Interaction modes

Qwen-ASR-Realtime supports two modes:
ModeenableTurnDetectionHow it works
VAD mode (default)trueThe server detects speech boundaries with VAD and commits the audio buffer for recognition.
Manual modefalseYou control when to commit audio by calling commit().
For details, see VAD mode and Manual mode.

Request parameters

Connection parameters (OmniRealtimeParam)

Set with OmniRealtimeParam chained methods.
OmniRealtimeParam param = OmniRealtimeParam.builder()
  .model("qwen3-asr-flash-realtime")
  .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
  // If you have not configured an environment variable, replace the following line with .apikey("sk-xxx").
  .apikey(System.getenv("DASHSCOPE_API_KEY"))
  .build();
ParameterTypeRequiredDescription
modelStringYesModel name. Example: qwen3-asr-flash-realtime.
urlStringYesService endpoint: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime.
apikeyStringNoAPI key.

Session configuration (OmniRealtimeConfig)

Set with OmniRealtimeConfig chained methods.
OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");

OmniRealtimeConfig config = OmniRealtimeConfig.builder()
  .modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
  .enableTurnDetection(true)
  .turnDetectionType("server_vad")
  .turnDetectionThreshold(0.0f)
  .turnDetectionSilenceDurationMs(400)
  .transcriptionConfig(transcriptionParam)
  .build();
ParameterTypeRequiredDescription
modalitiesList<OmniRealtimeModality>YesOutput modality. Fixed to [OmniRealtimeModality.TEXT].
enableTurnDetectionbooleanNoEnables server-side VAD. When disabled, call commit() to trigger recognition manually. Default: true.
turnDetectionTypeStringNoVAD type. Fixed to server_vad.
turnDetectionThresholdfloatNoVAD threshold. Recommended: 0.0. Default: 0.2. Valid range: [-1, 1]. Lower values increase sensitivity but risk treating background noise as speech. Higher values reduce sensitivity, which suits noisy environments.
turnDetectionSilenceDurationMsintNoVAD silence threshold in milliseconds. Silence beyond this duration marks the end of a statement. Recommended: 400. Default: 800. Valid range: [200, 6000]. Lower values (such as 300 ms) speed up the response but may split at normal pauses. Higher values (such as 1200 ms) handle long pauses better but add latency.
transcriptionConfigOmniRealtimeTranscriptionParamNoSpeech recognition settings. See Transcription parameters.

Transcription parameters (OmniRealtimeTranscriptionParam)

Set with OmniRealtimeTranscriptionParam setter methods.
OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputSampleRate(16000);
transcriptionParam.setInputAudioFormat("pcm");
ParameterTypeRequiredDescription
languageStringNoSource language of the audio. Supported values: zh (Chinese - Mandarin, Sichuanese, Minnan, and Wu), yue (Cantonese), en (English), ja (Japanese), de (German), ko (Korean), ru (Russian), fr (French), pt (Portuguese), ar (Arabic), it (Italian), es (Spanish), hi (Hindi), id (Indonesian), th (Thai), tr (Turkish), uk (Ukrainian), vi (Vietnamese), cs (Czech), da (Danish), fil (Filipino), fi (Finnish), is (Icelandic), ms (Malay), no (Norwegian), pl (Polish), sv (Swedish).
inputSampleRateintNoAudio sample rate in Hz. Supported: 16000, 8000. Default: 16000. Setting 8000 triggers server-side upsampling to 16000 Hz, which may add a minor delay. Use 8000 only for 8000 Hz source audio (such as telephone lines).
inputAudioFormatStringNoAudio format. Supported: pcm, opus. Default: pcm.
corpusTextStringNoContext (background text, entity vocabularies, reference information) to customize recognition. Limit: 10,000 tokens. See Context biasing.

Key interfaces

OmniRealtimeConversation

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeConversation Manages the WebSocket lifecycle: connect, send audio, and end the session.

Create a conversation

OmniRealtimeConversation conversation =
  new OmniRealtimeConversation(param, callback);
Creates a conversation with the specified connection parameters and callback handler.

Connect to the server

conversation.connect();
Opens a WebSocket connection. The server sends session.created and session.updated events. Throws: NoApiKeyException, InterruptedException.

Configure the session

conversation.updateSession(config);
Updates the session configuration after connecting. The server sends a session.updated event. If not called, the server uses defaults.

Send audio data

conversation.appendAudio(audioBase64);
Appends Base64-encoded audio to the server-side audio buffer.
  • VAD mode (enableTurnDetection=true): The server detects speech boundaries and decides when to process the buffer.
  • Manual mode (enableTurnDetection=false): Audio accumulates until you call commit(). Each event can contain up to 15 MiB of audio data.

Commit the audio buffer

conversation.commit();
Submits buffered audio for recognition. The server sends an input_audio_buffer.committed event.
Only available in manual mode (enableTurnDetection=false). Returns an error if the audio buffer is empty.

End the session

conversation.endSession();  // synchronous
// or
conversation.endSessionAsync();  // asynchronous
Tells the server to finish processing remaining audio and end the session. The server sends a session.finished event. When to call:
  • VAD mode: After you finish sending audio.
  • Manual mode: After you call commit().

Close the connection

conversation.close();
Stops the task and closes the WebSocket connection immediately.

Get session and response IDs

String sessionId = conversation.getSessionId();
String responseId = conversation.getResponseId();
  • getSessionId() returns the session ID for the current task.
  • getResponseId() returns the response ID from the most recent server response.

OmniRealtimeCallback

Import: com.alibaba.dashscope.audio.omni.OmniRealtimeCallback Implement this class to handle server events.
MethodParametersTriggered when
onOpen()NoneWebSocket connection established.
onEvent(JsonObject message)message: A server event as JSON. Common types: session.created, session.updated, input_audio_buffer.committed, conversation.item.input_audio_transcription.completed, session.finished.A server event is received. Parse the type field to determine the event type.
onClose(int code, String reason)code: Status code. reason: Close reason.WebSocket connection closed.