Qwen-Omni Java SDK - Qwen Cloud

Key interfaces and request parameters for the Qwen-Omni-Realtime DashScope Java SDK.

Prerequisites

Ensure that your Java SDK version is 2.22.15 or later. Before you begin, see Real-time multimodal interaction flow.

Request parameters

Configure the following request parameters using the chained methods or setters of the OmniRealtimeParam object. Then, pass this object and a callback instance to the OmniRealtimeConversation constructor.

Parameter	Type	Description
model	String	The name of the Qwen-Omni real-time model. See Model list.
url	String	The endpoint URL: `wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime`

Configure the following request parameters using the chained methods or setters of the OmniRealtimeConfig object. Then, pass this object as a parameter to the updateSession interface.

Parameter	Type	Description
modalities	List<OmniRealtimeModality>	The output modalities of the model. Set to `[OmniRealtimeModality.TEXT]` for text output only, or `[OmniRealtimeModality.TEXT, OmniRealtimeModality.AUDIO]` for audio and text output.
voice	String	The voice used for the model's audio output. Accepts preset voices or cloned voices (Qwen3.5-Omni-Realtime only) created via the Voice cloning API. For a list of supported preset voices, see Voice list. Default voice: `Qwen3.5-Omni-Realtime`: `"Tina"`, `Qwen3-Omni-Flash-Realtime`: `"Cherry"`, `Qwen-Omni-Turbo-Realtime`: `"Chelsie"`.
inputAudioFormat	OmniRealtimeAudioFormat	The format of the user's input audio. Currently only supports `PCM_16000HZ_MONO_16BIT`, which represents a PCM audio stream at a 16 kHz sample rate.
outputAudioFormat	OmniRealtimeAudioFormat	The format of the model's output audio. Currently only supports `PCM_24000HZ_MONO_16BIT`, which represents a PCM audio stream at a 24 kHz sample rate.
instructions	String	A system message that sets the goal or role for the model. Example: "You are an AI customer service agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies."
enableInputAudioTranscription	Boolean	Specifies whether to enable speech recognition for the input audio.
InputAudioTranscription	String	The speech recognition model used for transcribing input audio. Always `qwen3-asr-flash-realtime`. This parameter is not configurable.
enableTurnDetection	Boolean	Specifies whether to enable voice activity detection (VAD). If disabled, you must manually submit audio to trigger a model response.
turnDetectionType	String	The VAD type. Valid values: `"server_vad"` (default): Detects the end of user speech based on acoustic features. `"semantic_vad"`: Detects the end of user speech based on semantic validity. This mode can filter out meaningless speech, such as backchannels and background noise. Supported only by the `qwen3.5-omni-realtime` model.
turnDetectionThreshold	Float	The VAD threshold. Increase this value in noisy environments and decrease it in quiet environments. A value closer to -1 increases the probability that noise is detected as speech. A value closer to 1 decreases the probability. Default value: 0.5. Value range: [-1.0, 1.0].
turnDetectionSilenceDurationMs	Integer	The duration of silence (in milliseconds) that indicates the end of speech. The model triggers a response after this duration elapses. Default value: 800. Value range: [200, 6000].
turnDetectionParam	Map	Additional `turn_detection` configuration parameters. Currently supports `idle_timeout_ms` (Integer): the idle timeout in milliseconds. Applies only to `qwen3.5-omni-plus-realtime` and `qwen3.5-omni-flash-realtime` models in `server_vad` mode. After the server finishes audio playback and the user remains silent beyond this duration (no `speech.started` triggered), the model proactively generates a response to prompt the user to continue the conversation. Valid range: [5000, 30000]. Example: `turnDetectionParam(Map.of("idle_timeout_ms", 5000))`
temperature	float	The sampling temperature, which controls the diversity of the generated content. A higher value results in more diverse content. A lower value results in more deterministic content. Value range: [0, 2). Because both `temperature` and `top_p` control content diversity, we recommend setting only one of them. Defaults: `qwen3.5-omni-realtime` series: 0.7, `qwen3-omni-flash-realtime` series: 0.9, `qwen-omni-turbo-realtime` series: 1.0.
top_p	float	The probability threshold for nucleus sampling, which controls the diversity of the generated content. A higher value results in more diverse content. A lower value results in more deterministic content. Value range: (0, 1.0]. Because both `temperature` and `top_p` control content diversity, we recommend setting only one of them. Defaults: `qwen3.5-omni-realtime` series: 0.8, `qwen3-omni-flash-realtime` series: 1.0, `qwen-omni-turbo-realtime` series: 0.01.
top_k	integer	The size of the candidate token set for sampling during generation. A larger value increases randomness, while a smaller value increases determinism. If the value is `None` or greater than 100, `top_k` sampling is disabled and only `top_p` sampling takes effect. The value must be 0 or greater. Defaults: `qwen3.5-omni-realtime` series: 20, `qwen3-omni-flash-realtime` series: 50, `qwen-omni-turbo-realtime` series: 20.
max_tokens	integer	The maximum number of tokens that can be returned in the response. This parameter does not affect the model's generation process. If the number of tokens generated by the model exceeds `max_tokens`, the returned content is truncated. The default and maximum values are the maximum output length of the model. Use `max_tokens` in scenarios where you need to limit the output length, control costs, or reduce response latency.
repetition_penalty	float	Controls the repetition penalty for consecutive sequences during model generation. A higher value reduces repetition. A value of 1.0 means no penalty is applied. The value must be greater than 0. Defaults: `qwen3.5-omni-realtime` series: 1.0, `qwen3-omni-flash-realtime` series: 1.05.
presence_penalty	float	Controls the likelihood of repeated tokens in the generated content. Value range: [-2.0, 2.0]. Positive values reduce repetition, while negative values increase it. A higher value is suitable for creative writing or brainstorming. A lower value is suitable for technical documents or formal writing. Defaults: `qwen3.5-omni-realtime` series: 1.5, `qwen3-omni-flash-realtime` series: 0.0.
seed	integer	Setting the seed parameter makes the model's output more deterministic. If you pass the same seed value in each model call and keep other parameters unchanged, the model returns the same result whenever possible. Value range: 0 to 2^31-1. Default value: -1.
tools	List<Map<String, Object>>	A list of tool definitions. When specified, the model autonomously determines whether to call external tools to respond to user questions. If a tool call is triggered, the model returns only tool call parameters without generating audio. This parameter takes effect only when you use the `Qwen3.5-Omni-Realtime` model. `tools` and `enable_search` are incompatible. You cannot enable them at the same time. Each tool is a Map containing: `type` (String, required): fixed to `"function"`; `function` (Map, required): contains `name` (String, required, e.g. `get_current_weather`), `description` (String, optional, used by the model to decide whether to call the tool), and `parameters` (Map, optional, contains `type` fixed to `"object"`, `properties` describing each input parameter's type and description, and `required` listing required parameters).
enable_search	Boolean	Whether to enable web search. When enabled, the model autonomously determines whether to search the web to respond to user questions. Default: `false`. This parameter takes effect only when you use the `Qwen3.5-Omni-Realtime` model. `tools` and `enable_search` are incompatible. You cannot enable them at the same time.
search_options	Object	Web search options. Takes effect only when `enable_search` is enabled. Contains: `enable_source` (Boolean): set to `true` to return search result sources.

qwen-omni-turbo models do not support modification of temperature, top_p, top_k, max_tokens, repetition_penalty, presence_penalty, or seed. For other models, set these parameters (along with instructions, tools, enable_search, and search_options) using the parameters method of the OmniRealtimeConfig instance:

conversation.updateSession(OmniRealtimeConfig.builder()
  .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
  .voice("Cherry")
  .enableTurnDetection(true)
  .enableInputAudioTranscription(true)
  .parameters(Map.of(
    "instructions", "You are a helpful assistant."))
  .build()
);

Key interfaces

OmniRealtimeConversation class

Import the OmniRealtimeConversation class using import com.alibaba.dashscope.audio.omni.OmniRealtimeConversation;.

Constructor

public OmniRealtimeConversation(OmniRealtimeParam param, OmniRealtimeCallback callback)

Creates an OmniRealtimeConversation instance. Parameters:

Parameter	Type	Description
param	OmniRealtimeParam	The configuration parameters for the conversation, including model, URL, and API key.
callback	OmniRealtimeCallback	The callback instance that handles server-side events. See Callback interface.

connect

public void connect() throws NoApiKeyException, InterruptedException

Creates a connection to the server. Server response events: session.created (Session created), session.updated (Session configuration updated).

updateSession

public void updateSession(OmniRealtimeConfig config)

Updates the default configuration for the current session. For parameter settings, see the Request parameters section. When you establish a connection, the server returns the default input and output configurations for the session. To update the default session configuration, we recommend calling this method immediately after the connection is established. After receiving the session.update event, the server validates the parameters. If the parameters are invalid, an error is returned. Otherwise, the server-side session configuration is updated. Server response events: session.updated (Session configuration updated).

appendAudio

public void appendAudio(String audioBase64)

Appends Base64-encoded audio data to the cloud input audio buffer (temporary storage for writing data before submission).

If turn_detection is enabled, the server uses the buffer to detect speech and decides when to submit it.
If turn_detection is disabled, the client controls audio amount per event (up to 15 MiB). Smaller blocks can improve VAD responsiveness.

Server response events: None.

appendVideo

public void appendVideo(String videoBase64)

Adds Base64-encoded image data to the cloud video buffer (local images or real-time video stream captures). Image input limits:

Format: JPG or JPEG. Recommended resolution: 480p or 720p (max 1080p).
Size: Max 256 KB per image after Base64 encoding (recommend raw image below 190 KB).
Encoding: Must be Base64-encoded.
Frequency: 1 image per second.

Server response events: None.

clearAppendedAudio

public void clearAppendedAudio()

Clears the audio in the current cloud buffer. Server response events: input_audio_buffer.cleared (Audio received by the server is cleared).

commit

public void commit()

Submits audio and video previously added to the cloud buffer via append. Returns an error if the input audio buffer is empty.

If turn_detection is enabled, the server automatically submits the audio buffer (client does not need to send this event).
If turn_detection is disabled, the client must submit the audio buffer to create a user message item.

If input_audio_transcription is configured, the system transcribes the audio.
Submitting the buffer does not trigger a model response.

Server response events: input_audio_buffer.committed (Server received the submitted audio).

createResponse

public void createResponse(String instructions, List<OmniRealtimeModality> modalities)

Instructs the server to create a model response. When the session is configured with turn_detection mode enabled, the server automatically creates a model response. Server response events: response.created, response.output_item.added, conversation.item.created, response.content_part.added, response.audio_transcript.delta, response.audio.delta, response.audio_transcript.done, response.audio.done, response.content_part.done, response.output_item.done, response.done.

cancelResponse

public void cancelResponse()

Cancels the in-progress response. Returns an error if no response exists to cancel. Server response events: None.

createItem

public void createItem(JsonObject item)

Sends the conversation.item.create event to the server. In a tool calling scenario, this method is used to send the tool execution result back to the server. The item parameter is a JsonObject and must contain the following fields:

Field	Type	Description
type	String	Fixed value: `"function_call_output"`.
call_id	String	Corresponds to the `call_id` in the `response.function_call_arguments.done` event.
output	String	A string that represents the tool execution result.

Server response events: None.

close

public void close(int code, String reason)

Stops the task and closes the WebSocket connection. Parameters:

Parameter	Type	Description
code	int	The status code for closing the WebSocket.
reason	String	The reason for closing the WebSocket.

Server response events: None.

getSessionId

public String getSessionId()

Gets the session ID of the current task. Server response events: None.

getResponseId

public String getResponseId()

Gets the response ID of the most recent response. Server response events: None.

Callback interface (OmniRealtimeCallback)

The server uses callbacks to return events and data to the client. Implement these callback methods to handle server-side events and data. Import the interface using import com.alibaba.dashscope.audio.omni.OmniRealtimeCallback;.

Method	Parameter	Return value	Description
`onOpen()`	None	None	Called immediately after a connection is established to the server.
`onEvent(JsonObject message)`	`message`: The server response event.	None	Includes method call responses and model-generated text and audio. See Server events.
`onClose(int code, String reason)`	`code`: The status code for closing the WebSocket. `reason`: The reason for closing the WebSocket.	None	Called after the connection to the server is closed.

​Prerequisites

​Request parameters

​Key interfaces

​OmniRealtimeConversation class

​Constructor

​connect

​updateSession

​appendAudio

​appendVideo

​clearAppendedAudio

​commit

​createResponse

​cancelResponse

​createItem

​close

​getSessionId

​getResponseId

​Callback interface (OmniRealtimeCallback)

Prerequisites

Request parameters

Key interfaces

OmniRealtimeConversation class

Constructor

connect

updateSession

appendAudio

appendVideo

clearAppendedAudio

commit

createResponse

cancelResponse

createItem

close

getSessionId

getResponseId

Callback interface (OmniRealtimeCallback)