Qwen-Omni-Realtime Java SDK
Key interfaces and request parameters for the Qwen-Omni-Realtime DashScope Java SDK.
Ensure that your Java SDK version is 2.20.9 or later. Before you begin, see Real-time multimodal interaction flow.
You can download the sample code from GitHub. The following three usage scenarios are provided:
Configure the following request parameters using the chained methods or setters of the
Configure the following request parameters using the chained methods or setters of the
Import the
Creates an
Creates a connection to the server.
Server response events: session.created (Session created), session.updated (Session configuration updated).
Updates the default configuration for the current session. For parameter settings, see the Request parameters section.
When you establish a connection, the server returns the default input and output configurations for the session. To update the default session configuration, we recommend calling this method immediately after the connection is established.
After receiving the
Appends Base64-encoded audio data to the cloud input audio buffer (temporary storage for writing data before submission).
Adds Base64-encoded image data to the cloud video buffer (local images or real-time video stream captures).
Image input limits:
Clears the audio in the current cloud buffer.
Server response events: input_audio_buffer.cleared (Audio received by the server is cleared).
Submits audio and video previously added to the cloud buffer via append. Returns an error if the input audio buffer is empty.
Server response events: input_audio_buffer.committed (Server received the submitted audio).
Instructs the server to create a model response. When the session is configured with
Cancels the in-progress response. Returns an error if no response exists to cancel.
Server response events: None.
Stops the task and closes the WebSocket connection.
Parameters:
Server response events: None.
Gets the session ID of the current task.
Server response events: None.
Gets the response ID of the most recent response.
Server response events: None.
Gets the first-packet text latency of the most recent response.
Server response events: None.
Gets the first-packet audio latency of the most recent response.
Server response events: None.
The server uses callbacks to return events and data to the client. Implement these callback methods to handle server-side events and data.
Import the interface using
Prerequisites
Ensure that your Java SDK version is 2.20.9 or later. Before you begin, see Real-time multimodal interaction flow.
Getting started
You can download the sample code from GitHub. The following three usage scenarios are provided:
- Audio conversation example: Captures real-time audio from a microphone, enables VAD mode (automatic voice activity detection), and supports voice interruption.
Set the
enableTurnDetection parameter to true. Use headphones for audio playback to prevent echoes from triggering voice interruption.- Audio and video conversation example: Captures real-time audio and video from a microphone and camera, enables VAD mode, and supports voice interruption.
Set the
enableTurnDetection parameter to true. Use headphones for audio playback to prevent echoes from triggering voice interruption.- Local call: Uses local audio and images as input and enables Manual mode (manual control over the sending pace).
Set the
enableTurnDetection parameter to false.Request parameters
Configure the following request parameters using the chained methods or setters of the OmniRealtimeParam object. Then, pass this object and a callback instance to the OmniRealtimeConversation constructor.
| Parameter | Type | Description |
|---|---|---|
| model | String | The name of the Qwen-Omni real-time model. See Model list. |
| url | String | The endpoint URL: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
OmniRealtimeConfig object. Then, pass this object as a parameter to the updateSession interface.
| Parameter | Type | Description |
|---|---|---|
| modalities | List<OmniRealtimeModality> | The output modalities of the model. Set to [OmniRealtimeModality.TEXT] for text output only, or [OmniRealtimeModality.TEXT, OmniRealtimeModality.AUDIO] for audio and text output. |
| voice | String | The voice used for the model's audio output. For a list of supported voices, see Voice list. Default voice: Qwen3-Omni-Flash-Realtime: "Cherry", Qwen-Omni-Turbo-Realtime: "Chelsie". |
| inputAudioFormat | OmniRealtimeAudioFormat | The format of the user's input audio. Currently, only PCM_16000HZ_MONO_16BIT is supported. |
| outputAudioFormat | OmniRealtimeAudioFormat | The format of the model's output audio. Currently, only pcm is supported. |
| smooth_output | Boolean | Supported only by the Qwen3-Omni-Flash-Realtime series. true: Conversational responses. false: Formal responses (performance may be suboptimal if the content is difficult to read aloud). null: The model automatically chooses between conversational and formal response styles. |
| instructions | String | A system message that sets the goal or role for the model. Example: "You are an AI customer service agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies." |
| enableInputAudioTranscription | Boolean | Specifies whether to enable speech recognition for the input audio. |
| InputAudioTranscription | String | The speech recognition model used for transcribing input audio. Currently, only gummy-realtime-v1 is supported. |
| enableTurnDetection | Boolean | Specifies whether to enable voice activity detection (VAD). If disabled, you must manually submit audio to trigger a model response. |
| turnDetectionType | String | The server-side VAD type. Fixed value: "server_vad". |
| turnDetectionThreshold | Float | The VAD threshold. Increase this value in noisy environments and decrease it in quiet environments. A value closer to -1 increases the probability that noise is detected as speech. A value closer to 1 decreases the probability. Default value: 0.5. Value range: [-1.0, 1.0]. |
| turnDetectionSilenceDurationMs | Integer | The duration of silence (in milliseconds) that indicates the end of speech. The model triggers a response after this duration elapses. Default value: 800. Value range: [200, 6000]. |
| temperature | float | The sampling temperature, which controls the diversity of the generated content. A higher value results in more diverse content. A lower value results in more deterministic content. Value range: [0, 2). Because both temperature and top_p control content diversity, we recommend setting only one of them. Defaults: qwen3-omni-flash-realtime series: 0.9, qwen-omni-turbo-realtime series: 1.0. |
| top_p | float | The probability threshold for nucleus sampling, which controls the diversity of the generated content. A higher value results in more diverse content. A lower value results in more deterministic content. Value range: (0, 1.0]. Because both temperature and top_p control content diversity, we recommend setting only one of them. Defaults: qwen3-omni-flash-realtime series: 1.0, qwen-omni-turbo-realtime series: 0.01. |
| top_k | integer | The size of the candidate token set for sampling during generation. A larger value increases randomness, while a smaller value increases determinism. If the value is None or greater than 100, top_k sampling is disabled and only top_p sampling takes effect. The value must be 0 or greater. Defaults: qwen3-omni-flash-realtime series: 50, qwen-omni-turbo-realtime series: 20. |
| max_tokens | integer | The maximum number of tokens that can be returned in the response. This parameter does not affect the model's generation process. If the number of tokens generated by the model exceeds max_tokens, the returned content is truncated. The default and maximum values are the maximum output length of the model. Use max_tokens in scenarios where you need to limit the output length, control costs, or reduce response latency. |
| repetition_penalty | float | Controls the repetition penalty for consecutive sequences during model generation. A higher value reduces repetition. A value of 1.0 means no penalty is applied. The value must be greater than 0. Default value: 1.05. |
| presence_penalty | float | Controls the likelihood of repeated tokens in the generated content. Default value: 0.0. Value range: [-2.0, 2.0]. Positive values reduce repetition, while negative values increase it. A higher value is suitable for creative writing or brainstorming. A lower value is suitable for technical documents or formal writing. |
| seed | integer | Setting the seed parameter makes the model's output more deterministic. If you pass the same seed value in each model call and keep other parameters unchanged, the model returns the same result whenever possible. Value range: 0 to 2^31-1. Default value: -1. |
qwen-omni-turbo models do not support modification of temperature, top_p, top_k, max_tokens, repetition_penalty, presence_penalty, or seed. For other models, set these parameters (along with smooth_output and instructions) using the parameters method of the OmniRealtimeConfig instance:Key interfaces
OmniRealtimeConversation class
Import the OmniRealtimeConversation class using import com.alibaba.dashscope.audio.omni.OmniRealtimeConversation;.
Constructor
OmniRealtimeConversation instance.
Parameters:
| Parameter | Type | Description |
|---|---|---|
| param | OmniRealtimeParam | The configuration parameters for the conversation, including model, URL, and API key. |
| callback | OmniRealtimeCallback | The callback instance that handles server-side events. See Callback interface. |
connect
updateSession
session.update event, the server validates the parameters. If the parameters are invalid, an error is returned. Otherwise, the server-side session configuration is updated.
Server response events: session.updated (Session configuration updated).
appendAudio
- If
turn_detectionis enabled, the server uses the buffer to detect speech and decides when to submit it. - If
turn_detectionis disabled, the client controls audio amount per event (up to 15 MiB). Smaller blocks can improve VAD responsiveness.
appendVideo
- Format: JPG or JPEG. Recommended resolution: 480p or 720p (max 1080p).
- Size: Max 500 KB per image (before Base64 encoding).
- Encoding: Must be Base64-encoded.
- Frequency: 1 image per second.
clearAppendedAudio
commit
- If
turn_detectionis enabled, the server automatically submits the audio buffer (client does not need to send this event). - If
turn_detectionis disabled, the client must submit the audio buffer to create a user message item.
- If
input_audio_transcriptionis configured, the system transcribes the audio. - Submitting the buffer does not trigger a model response.
createResponse
turn_detection mode enabled, the server automatically creates a model response.
Server response events: response.created, response.output_item.added, conversation.item.created, response.content_part.added, response.audio_transcript.delta, response.audio.delta, response.audio_transcript.done, response.audio.done, response.content_part.done, response.output_item.done, response.done.
cancelResponse
close
| Parameter | Type | Description |
|---|---|---|
| code | int | The status code for closing the WebSocket. |
| reason | String | The reason for closing the WebSocket. |
getSessionId
getResponseId
getFirstTextDelay
getFirstAudioDelay
Callback interface (OmniRealtimeCallback)
The server uses callbacks to return events and data to the client. Implement these callback methods to handle server-side events and data.
Import the interface using import com.alibaba.dashscope.audio.omni.OmniRealtimeCallback;.
| Method | Parameter | Return value | Description |
|---|---|---|---|
onOpen() | None | None | Called immediately after a connection is established to the server. |
onEvent(JsonObject message) | message: The server response event. | None | Includes method call responses and model-generated text and audio. See Server events. |
onClose(int code, String reason) | code: The status code for closing the WebSocket. reason: The reason for closing the WebSocket. | None | Called after the connection to the server is closed. |