Qwen ASR Java integration
User guide: For the model overview, features, and complete sample code, see Realtime speech recognition.
Qwen-ASR-Realtime supports two modes:
For details, see VAD mode and Manual mode.
Set with
Set with
Set with
Import:
Creates a conversation with the specified connection parameters and callback handler.
Opens a WebSocket connection. The server sends session.created and session.updated events.
Throws:
Updates the session configuration after connecting. The server sends a session.updated event. If not called, the server uses defaults.
Appends Base64-encoded audio to the server-side audio buffer.
Submits buffered audio for recognition. The server sends an input_audio_buffer.committed event.
Tells the server to finish processing remaining audio and end the session. The server sends a session.finished event.
When to call:
Stops the task and closes the WebSocket connection immediately.
Import:
Prerequisites
- DashScope SDK 2.22.5 or later (Install the SDK)
- Get an API key
- Understand the interaction flow
Interaction modes
Qwen-ASR-Realtime supports two modes:
| Mode | enableTurnDetection | How it works |
|---|---|---|
| VAD mode (default) | true | The server detects speech boundaries with VAD and commits the audio buffer for recognition. |
| Manual mode | false | You control when to commit audio by calling commit(). |
Request parameters
Connection parameters (OmniRealtimeParam)
Set with OmniRealtimeParam chained methods.
Click to view sample code
Click to view sample code
| Parameter | Type | Required | Description |
|---|---|---|---|
model | String | Yes | Model name. Example: qwen3-asr-flash-realtime. |
url | String | Yes | Service endpoint: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime. |
apikey | String | No | API key. |
Session configuration (OmniRealtimeConfig)
Set with OmniRealtimeConfig chained methods.
Click to view sample code
Click to view sample code
| Parameter | Type | Required | Description |
|---|---|---|---|
modalities | List<OmniRealtimeModality> | Yes | Output modality. Fixed to [OmniRealtimeModality.TEXT]. |
enableTurnDetection | boolean | No | Enables server-side VAD. When disabled, call commit() to trigger recognition manually. Default: true. |
turnDetectionType | String | No | VAD type. Fixed to server_vad. |
turnDetectionThreshold | float | No | VAD threshold. Recommended: 0.0. Default: 0.2. Valid range: [-1, 1]. Lower values increase sensitivity but risk treating background noise as speech. Higher values reduce sensitivity, which suits noisy environments. |
turnDetectionSilenceDurationMs | int | No | VAD silence threshold in milliseconds. Silence beyond this duration marks the end of a statement. Recommended: 400. Default: 800. Valid range: [200, 6000]. Lower values (such as 300 ms) speed up the response but may split at normal pauses. Higher values (such as 1200 ms) handle long pauses better but add latency. |
transcriptionConfig | OmniRealtimeTranscriptionParam | No | Speech recognition settings. See Transcription parameters. |
Transcription parameters (OmniRealtimeTranscriptionParam)
Set with OmniRealtimeTranscriptionParam setter methods.
Click to view sample code
Click to view sample code
| Parameter | Type | Required | Description |
|---|---|---|---|
language | String | No | Source language of the audio. Supported values: zh (Chinese - Mandarin, Sichuanese, Minnan, and Wu), yue (Cantonese), en (English), ja (Japanese), de (German), ko (Korean), ru (Russian), fr (French), pt (Portuguese), ar (Arabic), it (Italian), es (Spanish), hi (Hindi), id (Indonesian), th (Thai), tr (Turkish), uk (Ukrainian), vi (Vietnamese), cs (Czech), da (Danish), fil (Filipino), fi (Finnish), is (Icelandic), ms (Malay), no (Norwegian), pl (Polish), sv (Swedish). |
inputSampleRate | int | No | Audio sample rate in Hz. Supported: 16000, 8000. Default: 16000. Setting 8000 triggers server-side upsampling to 16000 Hz, which may add a minor delay. Use 8000 only for 8000 Hz source audio (such as telephone lines). |
inputAudioFormat | String | No | Audio format. Supported: pcm, opus. Default: pcm. |
corpusText | String | No | Context (background text, entity vocabularies, reference information) to customize recognition. Limit: 10,000 tokens. See Context biasing. |
Key interfaces
OmniRealtimeConversation
Import: com.alibaba.dashscope.audio.omni.OmniRealtimeConversation
Manages the WebSocket lifecycle: connect, send audio, and end the session.
Create a conversation
Connect to the server
NoApiKeyException, InterruptedException.
Configure the session
Send audio data
- VAD mode (
enableTurnDetection=true): The server detects speech boundaries and decides when to process the buffer. - Manual mode (
enableTurnDetection=false): Audio accumulates until you callcommit(). Each event can contain up to 15 MiB of audio data.
Commit the audio buffer
Only available in manual mode (
enableTurnDetection=false). Returns an error if the audio buffer is empty.End the session
- VAD mode: After you finish sending audio.
- Manual mode: After you call
commit().
Close the connection
Get session and response IDs
getSessionId()returns the session ID for the current task.getResponseId()returns the response ID from the most recent server response.
OmniRealtimeCallback
Import: com.alibaba.dashscope.audio.omni.OmniRealtimeCallback
Implement this class to handle server events.
| Method | Parameters | Triggered when |
|---|---|---|
onOpen() | None | WebSocket connection established. |
onEvent(JsonObject message) | message: A server event as JSON. Common types: session.created, session.updated, input_audio_buffer.committed, conversation.item.input_audio_transcription.completed, session.finished. | A server event is received. Parse the type field to determine the event type. |
onClose(int code, String reason) | code: Status code. reason: Close reason. | WebSocket connection closed. |