Qwen-Omni-Realtime Java SDK
Key interfaces and request parameters for the Qwen-Omni-Realtime DashScope Java SDK.
Ensure that your Java SDK version is 2.22.15 or later. Before you begin, see Real-time multimodal interaction flow.
Configure the following request parameters using the chained methods or setters of the
Configure the following request parameters using the chained methods or setters of the
Import the
Creates an
Creates a connection to the server.
Server response events: session.created (Session created), session.updated (Session configuration updated).
Updates the default configuration for the current session. For parameter settings, see the Request parameters section.
When you establish a connection, the server returns the default input and output configurations for the session. To update the default session configuration, we recommend calling this method immediately after the connection is established.
After receiving the
Appends Base64-encoded audio data to the cloud input audio buffer (temporary storage for writing data before submission).
Adds Base64-encoded image data to the cloud video buffer (local images or real-time video stream captures).
Image input limits:
Clears the audio in the current cloud buffer.
Server response events: input_audio_buffer.cleared (Audio received by the server is cleared).
Submits audio and video previously added to the cloud buffer via append. Returns an error if the input audio buffer is empty.
Server response events: input_audio_buffer.committed (Server received the submitted audio).
Instructs the server to create a model response. When the session is configured with
Cancels the in-progress response. Returns an error if no response exists to cancel.
Server response events: None.
Sends the
Server response events: None.
Stops the task and closes the WebSocket connection.
Parameters:
Server response events: None.
Gets the session ID of the current task.
Server response events: None.
Gets the response ID of the most recent response.
Server response events: None.
The server uses callbacks to return events and data to the client. Implement these callback methods to handle server-side events and data.
Import the interface using
Prerequisites
Ensure that your Java SDK version is 2.22.15 or later. Before you begin, see Real-time multimodal interaction flow.
Request parameters
Configure the following request parameters using the chained methods or setters of the OmniRealtimeParam object. Then, pass this object and a callback instance to the OmniRealtimeConversation constructor.
| Parameter | Type | Description |
|---|---|---|
| model | String | The name of the Qwen-Omni real-time model. See Model list. |
| url | String | The endpoint URL: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
OmniRealtimeConfig object. Then, pass this object as a parameter to the updateSession interface.
| Parameter | Type | Description |
|---|---|---|
| modalities | List<OmniRealtimeModality> | The output modalities of the model. Set to [OmniRealtimeModality.TEXT] for text output only, or [OmniRealtimeModality.TEXT, OmniRealtimeModality.AUDIO] for audio and text output. |
| voice | String | The voice used for the model's audio output. Accepts preset voices or cloned voices (Qwen3.5-Omni-Realtime only) created via the Voice cloning API. For a list of supported preset voices, see Voice list. Default voice: Qwen3.5-Omni-Realtime: "Tina", Qwen3-Omni-Flash-Realtime: "Cherry", Qwen-Omni-Turbo-Realtime: "Chelsie". |
| inputAudioFormat | OmniRealtimeAudioFormat | The format of the user's input audio. Currently, only PCM_16000HZ_MONO_16BIT is supported. |
| outputAudioFormat | OmniRealtimeAudioFormat | The format of the model's output audio. Currently, only pcm is supported. |
| instructions | String | A system message that sets the goal or role for the model. Example: "You are an AI customer service agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies." |
| enableInputAudioTranscription | Boolean | Specifies whether to enable speech recognition for the input audio. |
| InputAudioTranscription | String | The speech recognition model used for transcribing input audio. Always qwen3-asr-flash-realtime. This parameter is not configurable. |
| enableTurnDetection | Boolean | Specifies whether to enable voice activity detection (VAD). If disabled, you must manually submit audio to trigger a model response. |
| turnDetectionType | String | The VAD type. Valid values: "server_vad" (default): Detects the end of user speech based on acoustic features. "semantic_vad": Detects the end of user speech based on semantic validity. This mode can filter out meaningless speech, such as backchannels and background noise. Supported only by the qwen3.5-omni-realtime model. |
| turnDetectionThreshold | Float | The VAD threshold. Increase this value in noisy environments and decrease it in quiet environments. A value closer to -1 increases the probability that noise is detected as speech. A value closer to 1 decreases the probability. Default value: 0.5. Value range: [-1.0, 1.0]. |
| turnDetectionSilenceDurationMs | Integer | The duration of silence (in milliseconds) that indicates the end of speech. The model triggers a response after this duration elapses. Default value: 800. Value range: [200, 6000]. |
| temperature | float | The sampling temperature, which controls the diversity of the generated content. A higher value results in more diverse content. A lower value results in more deterministic content. Value range: [0, 2). Because both temperature and top_p control content diversity, we recommend setting only one of them. Defaults: qwen3.5-omni-realtime series: 0.7, qwen3-omni-flash-realtime series: 0.9, qwen-omni-turbo-realtime series: 1.0. |
| top_p | float | The probability threshold for nucleus sampling, which controls the diversity of the generated content. A higher value results in more diverse content. A lower value results in more deterministic content. Value range: (0, 1.0]. Because both temperature and top_p control content diversity, we recommend setting only one of them. Defaults: qwen3.5-omni-realtime series: 0.8, qwen3-omni-flash-realtime series: 1.0, qwen-omni-turbo-realtime series: 0.01. |
| top_k | integer | The size of the candidate token set for sampling during generation. A larger value increases randomness, while a smaller value increases determinism. If the value is None or greater than 100, top_k sampling is disabled and only top_p sampling takes effect. The value must be 0 or greater. Defaults: qwen3.5-omni-realtime series: 20, qwen3-omni-flash-realtime series: 50, qwen-omni-turbo-realtime series: 20. |
| max_tokens | integer | The maximum number of tokens that can be returned in the response. This parameter does not affect the model's generation process. If the number of tokens generated by the model exceeds max_tokens, the returned content is truncated. The default and maximum values are the maximum output length of the model. Use max_tokens in scenarios where you need to limit the output length, control costs, or reduce response latency. |
| repetition_penalty | float | Controls the repetition penalty for consecutive sequences during model generation. A higher value reduces repetition. A value of 1.0 means no penalty is applied. The value must be greater than 0. Defaults: qwen3.5-omni-realtime series: 1.0, qwen3-omni-flash-realtime series: 1.05. |
| presence_penalty | float | Controls the likelihood of repeated tokens in the generated content. Value range: [-2.0, 2.0]. Positive values reduce repetition, while negative values increase it. A higher value is suitable for creative writing or brainstorming. A lower value is suitable for technical documents or formal writing. Defaults: qwen3.5-omni-realtime series: 1.5, qwen3-omni-flash-realtime series: 0.0. |
| seed | integer | Setting the seed parameter makes the model's output more deterministic. If you pass the same seed value in each model call and keep other parameters unchanged, the model returns the same result whenever possible. Value range: 0 to 2^31-1. Default value: -1. |
| tools | List<Map<String, Object>> | A list of tool definitions. When specified, the model autonomously determines whether to call external tools to respond to user questions. If a tool call is triggered, the model returns only tool call parameters without generating audio. This parameter takes effect only when you use the Qwen3.5-Omni-Realtime model. tools and enable_search are incompatible. You cannot enable them at the same time. Each tool is a Map containing: type (String, required): fixed to "function"; function (Map, required): contains name (String, required, e.g. get_current_weather), description (String, optional, used by the model to decide whether to call the tool), and parameters (Map, optional, contains type fixed to "object", properties describing each input parameter's type and description, and required listing required parameters). |
| enable_search | Boolean | Whether to enable web search. When enabled, the model autonomously determines whether to search the web to respond to user questions. Default: false. This parameter takes effect only when you use the Qwen3.5-Omni-Realtime model. tools and enable_search are incompatible. You cannot enable them at the same time. |
| search_options | Object | Web search options. Takes effect only when enable_search is enabled. Contains: enable_source (Boolean): set to true to return search result sources. |
qwen-omni-turbo models do not support modification of temperature, top_p, top_k, max_tokens, repetition_penalty, presence_penalty, or seed. For other models, set these parameters (along with instructions, tools, enable_search, and search_options) using the parameters method of the OmniRealtimeConfig instance:Key interfaces
OmniRealtimeConversation class
Import the OmniRealtimeConversation class using import com.alibaba.dashscope.audio.omni.OmniRealtimeConversation;.
Constructor
OmniRealtimeConversation instance.
Parameters:
| Parameter | Type | Description |
|---|---|---|
| param | OmniRealtimeParam | The configuration parameters for the conversation, including model, URL, and API key. |
| callback | OmniRealtimeCallback | The callback instance that handles server-side events. See Callback interface. |
connect
updateSession
session.update event, the server validates the parameters. If the parameters are invalid, an error is returned. Otherwise, the server-side session configuration is updated.
Server response events: session.updated (Session configuration updated).
appendAudio
- If
turn_detectionis enabled, the server uses the buffer to detect speech and decides when to submit it. - If
turn_detectionis disabled, the client controls audio amount per event (up to 15 MiB). Smaller blocks can improve VAD responsiveness.
appendVideo
- Format: JPG or JPEG. Recommended resolution: 480p or 720p (max 1080p).
- Size: Max 500 KB per image (before Base64 encoding).
- Encoding: Must be Base64-encoded.
- Frequency: 1 image per second.
clearAppendedAudio
commit
- If
turn_detectionis enabled, the server automatically submits the audio buffer (client does not need to send this event). - If
turn_detectionis disabled, the client must submit the audio buffer to create a user message item.
- If
input_audio_transcriptionis configured, the system transcribes the audio. - Submitting the buffer does not trigger a model response.
createResponse
turn_detection mode enabled, the server automatically creates a model response.
Server response events: response.created, response.output_item.added, conversation.item.created, response.content_part.added, response.audio_transcript.delta, response.audio.delta, response.audio_transcript.done, response.audio.done, response.content_part.done, response.output_item.done, response.done.
cancelResponse
createItem
conversation.item.create event to the server. In a tool calling scenario, this method is used to send the tool execution result back to the server.
The item parameter is a JsonObject and must contain the following fields:
| Field | Type | Description |
|---|---|---|
| type | String | Fixed value: "function_call_output". |
| call_id | String | Corresponds to the call_id in the response.function_call_arguments.done event. |
| output | String | A string that represents the tool execution result. |
close
| Parameter | Type | Description |
|---|---|---|
| code | int | The status code for closing the WebSocket. |
| reason | String | The reason for closing the WebSocket. |
getSessionId
getResponseId
Callback interface (OmniRealtimeCallback)
The server uses callbacks to return events and data to the client. Implement these callback methods to handle server-side events and data.
Import the interface using import com.alibaba.dashscope.audio.omni.OmniRealtimeCallback;.
| Method | Parameter | Return value | Description |
|---|---|---|---|
onOpen() | None | None | Called immediately after a connection is established to the server. |
onEvent(JsonObject message) | message: The server response event. | None | Includes method call responses and model-generated text and audio. See Server events. |
onClose(int code, String reason) | code: The status code for closing the WebSocket. reason: The reason for closing the WebSocket. | None | Called after the connection to the server is closed. |