Skip to main content
Qwen-Omni-Realtime

Qwen-Omni Java SDK

Qwen-Omni-Realtime Java SDK

Key interfaces and request parameters for the Qwen-Omni-Realtime DashScope Java SDK.

Prerequisites

Ensure that your Java SDK version is 2.20.9 or later. Before you begin, see Real-time multimodal interaction flow.

Getting started

You can download the sample code from GitHub. The following three usage scenarios are provided:
  1. Audio conversation example: Captures real-time audio from a microphone, enables VAD mode (automatic voice activity detection), and supports voice interruption.
Set the enableTurnDetection parameter to true. Use headphones for audio playback to prevent echoes from triggering voice interruption.
  1. Audio and video conversation example: Captures real-time audio and video from a microphone and camera, enables VAD mode, and supports voice interruption.
Set the enableTurnDetection parameter to true. Use headphones for audio playback to prevent echoes from triggering voice interruption.
  1. Local call: Uses local audio and images as input and enables Manual mode (manual control over the sending pace).
Set the enableTurnDetection parameter to false.

Request parameters

Configure the following request parameters using the chained methods or setters of the OmniRealtimeParam object. Then, pass this object and a callback instance to the OmniRealtimeConversation constructor.
ParameterTypeDescription
modelStringThe name of the Qwen-Omni real-time model. See Model list.
urlStringThe endpoint URL: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime
Configure the following request parameters using the chained methods or setters of the OmniRealtimeConfig object. Then, pass this object as a parameter to the updateSession interface.
ParameterTypeDescription
modalitiesList<OmniRealtimeModality>The output modalities of the model. Set to [OmniRealtimeModality.TEXT] for text output only, or [OmniRealtimeModality.TEXT, OmniRealtimeModality.AUDIO] for audio and text output.
voiceStringThe voice used for the model's audio output. For a list of supported voices, see Voice list. Default voice: Qwen3-Omni-Flash-Realtime: "Cherry", Qwen-Omni-Turbo-Realtime: "Chelsie".
inputAudioFormatOmniRealtimeAudioFormatThe format of the user's input audio. Currently, only PCM_16000HZ_MONO_16BIT is supported.
outputAudioFormatOmniRealtimeAudioFormatThe format of the model's output audio. Currently, only pcm is supported.
smooth_outputBooleanSupported only by the Qwen3-Omni-Flash-Realtime series. true: Conversational responses. false: Formal responses (performance may be suboptimal if the content is difficult to read aloud). null: The model automatically chooses between conversational and formal response styles.
instructionsStringA system message that sets the goal or role for the model. Example: "You are an AI customer service agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies."
enableInputAudioTranscriptionBooleanSpecifies whether to enable speech recognition for the input audio.
InputAudioTranscriptionStringThe speech recognition model used for transcribing input audio. Currently, only gummy-realtime-v1 is supported.
enableTurnDetectionBooleanSpecifies whether to enable voice activity detection (VAD). If disabled, you must manually submit audio to trigger a model response.
turnDetectionTypeStringThe server-side VAD type. Fixed value: "server_vad".
turnDetectionThresholdFloatThe VAD threshold. Increase this value in noisy environments and decrease it in quiet environments. A value closer to -1 increases the probability that noise is detected as speech. A value closer to 1 decreases the probability. Default value: 0.5. Value range: [-1.0, 1.0].
turnDetectionSilenceDurationMsIntegerThe duration of silence (in milliseconds) that indicates the end of speech. The model triggers a response after this duration elapses. Default value: 800. Value range: [200, 6000].
temperaturefloatThe sampling temperature, which controls the diversity of the generated content. A higher value results in more diverse content. A lower value results in more deterministic content. Value range: [0, 2). Because both temperature and top_p control content diversity, we recommend setting only one of them. Defaults: qwen3-omni-flash-realtime series: 0.9, qwen-omni-turbo-realtime series: 1.0.
top_pfloatThe probability threshold for nucleus sampling, which controls the diversity of the generated content. A higher value results in more diverse content. A lower value results in more deterministic content. Value range: (0, 1.0]. Because both temperature and top_p control content diversity, we recommend setting only one of them. Defaults: qwen3-omni-flash-realtime series: 1.0, qwen-omni-turbo-realtime series: 0.01.
top_kintegerThe size of the candidate token set for sampling during generation. A larger value increases randomness, while a smaller value increases determinism. If the value is None or greater than 100, top_k sampling is disabled and only top_p sampling takes effect. The value must be 0 or greater. Defaults: qwen3-omni-flash-realtime series: 50, qwen-omni-turbo-realtime series: 20.
max_tokensintegerThe maximum number of tokens that can be returned in the response. This parameter does not affect the model's generation process. If the number of tokens generated by the model exceeds max_tokens, the returned content is truncated. The default and maximum values are the maximum output length of the model. Use max_tokens in scenarios where you need to limit the output length, control costs, or reduce response latency.
repetition_penaltyfloatControls the repetition penalty for consecutive sequences during model generation. A higher value reduces repetition. A value of 1.0 means no penalty is applied. The value must be greater than 0. Default value: 1.05.
presence_penaltyfloatControls the likelihood of repeated tokens in the generated content. Default value: 0.0. Value range: [-2.0, 2.0]. Positive values reduce repetition, while negative values increase it. A higher value is suitable for creative writing or brainstorming. A lower value is suitable for technical documents or formal writing.
seedintegerSetting the seed parameter makes the model's output more deterministic. If you pass the same seed value in each model call and keep other parameters unchanged, the model returns the same result whenever possible. Value range: 0 to 2^31-1. Default value: -1.
qwen-omni-turbo models do not support modification of temperature, top_p, top_k, max_tokens, repetition_penalty, presence_penalty, or seed. For other models, set these parameters (along with smooth_output and instructions) using the parameters method of the OmniRealtimeConfig instance:
conversation.updateSession(OmniRealtimeConfig.builder()
  .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
  .voice("Cherry")
  .enableTurnDetection(true)
  .enableInputAudioTranscription(true)
  .parameters(Map.of(
    "smooth_output", true))
  .build()
);

Key interfaces

OmniRealtimeConversation class

Import the OmniRealtimeConversation class using import com.alibaba.dashscope.audio.omni.OmniRealtimeConversation;.

Constructor

public OmniRealtimeConversation(OmniRealtimeParam param, OmniRealtimeCallback callback)
Creates an OmniRealtimeConversation instance. Parameters:
ParameterTypeDescription
paramOmniRealtimeParamThe configuration parameters for the conversation, including model, URL, and API key.
callbackOmniRealtimeCallbackThe callback instance that handles server-side events. See Callback interface.

connect

public void connect() throws NoApiKeyException, InterruptedException
Creates a connection to the server. Server response events: session.created (Session created), session.updated (Session configuration updated).

updateSession

public void updateSession(OmniRealtimeConfig config)
Updates the default configuration for the current session. For parameter settings, see the Request parameters section. When you establish a connection, the server returns the default input and output configurations for the session. To update the default session configuration, we recommend calling this method immediately after the connection is established. After receiving the session.update event, the server validates the parameters. If the parameters are invalid, an error is returned. Otherwise, the server-side session configuration is updated. Server response events: session.updated (Session configuration updated).

appendAudio

public void appendAudio(String audioBase64)
Appends Base64-encoded audio data to the cloud input audio buffer (temporary storage for writing data before submission).
  • If turn_detection is enabled, the server uses the buffer to detect speech and decides when to submit it.
  • If turn_detection is disabled, the client controls audio amount per event (up to 15 MiB). Smaller blocks can improve VAD responsiveness.
Server response events: None.

appendVideo

public void appendVideo(String videoBase64)
Adds Base64-encoded image data to the cloud video buffer (local images or real-time video stream captures). Image input limits:
  • Format: JPG or JPEG. Recommended resolution: 480p or 720p (max 1080p).
  • Size: Max 500 KB per image (before Base64 encoding).
  • Encoding: Must be Base64-encoded.
  • Frequency: 1 image per second.
Server response events: None.

clearAppendedAudio

public void clearAppendedAudio()
Clears the audio in the current cloud buffer. Server response events: input_audio_buffer.cleared (Audio received by the server is cleared).

commit

public void commit()
Submits audio and video previously added to the cloud buffer via append. Returns an error if the input audio buffer is empty.
  • If turn_detection is enabled, the server automatically submits the audio buffer (client does not need to send this event).
  • If turn_detection is disabled, the client must submit the audio buffer to create a user message item.
  1. If input_audio_transcription is configured, the system transcribes the audio.
  2. Submitting the buffer does not trigger a model response.
Server response events: input_audio_buffer.committed (Server received the submitted audio).

createResponse

public void createResponse(String instructions, List<OmniRealtimeModality> modalities)
Instructs the server to create a model response. When the session is configured with turn_detection mode enabled, the server automatically creates a model response. Server response events: response.created, response.output_item.added, conversation.item.created, response.content_part.added, response.audio_transcript.delta, response.audio.delta, response.audio_transcript.done, response.audio.done, response.content_part.done, response.output_item.done, response.done.

cancelResponse

public void cancelResponse()
Cancels the in-progress response. Returns an error if no response exists to cancel. Server response events: None.

close

public void close(int code, String reason)
Stops the task and closes the WebSocket connection. Parameters:
ParameterTypeDescription
codeintThe status code for closing the WebSocket.
reasonStringThe reason for closing the WebSocket.
Server response events: None.

getSessionId

public String getSessionId()
Gets the session ID of the current task. Server response events: None.

getResponseId

public String getResponseId()
Gets the response ID of the most recent response. Server response events: None.

getFirstTextDelay

public long getFirstTextDelay()
Gets the first-packet text latency of the most recent response. Server response events: None.

getFirstAudioDelay

public long getFirstAudioDelay()
Gets the first-packet audio latency of the most recent response. Server response events: None.

Callback interface (OmniRealtimeCallback)

The server uses callbacks to return events and data to the client. Implement these callback methods to handle server-side events and data. Import the interface using import com.alibaba.dashscope.audio.omni.OmniRealtimeCallback;.
MethodParameterReturn valueDescription
onOpen()NoneNoneCalled immediately after a connection is established to the server.
onEvent(JsonObject message)message: The server response event.NoneIncludes method call responses and model-generated text and audio. See Server events.
onClose(int code, String reason)code: The status code for closing the WebSocket. reason: The reason for closing the WebSocket.NoneCalled after the connection to the server is closed.