Qwen-Omni Python SDK

This topic describes the key interfaces and request parameters for the Qwen-Omni-Realtime DashScope Python SDK.

Prerequisites

SDK version 1.25.17+ is required. See Real-time multimodal interaction flow.

Request parameters

Set the following request parameters using the constructor method (__init__) of the OmniRealtimeConversation class.

Parameter	Type	Description
model	str	The Qwen-Omni model name. See Model list.
callback	OmniRealtimeCallback	The callback object instance that handles server-side events.
url	str	The call address: `wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime`

Configure the following request parameters using the update_session method.

Parameter	Type	Description
output_modalities	list[MultiModality]	Model output modality. Set to `[MultiModality.TEXT]` for text only, or `[MultiModality.TEXT, MultiModality.AUDIO]` for both.
voice	str	The voice for the audio generated by the model. Accepts preset voices or cloned voices (Qwen3.5-Omni-Realtime only) created via the Voice cloning API. For a list of supported preset voices, see Voice list. Default voices: `Qwen3.5-Omni-Realtime`: `"Tina"`, `Qwen3-Omni-Flash-Realtime`: `"Cherry"`, `Qwen-Omni-Turbo-Realtime`: `"Chelsie"`.
input_audio_format	AudioFormat	The format of the user's input audio. Currently only supports `PCM_16000HZ_MONO_16BIT`, which represents a PCM audio stream at a 16 kHz sample rate.
output_audio_format	AudioFormat	The format of the model's output audio. Currently only supports `PCM_24000HZ_MONO_16BIT`, which represents a PCM audio stream at a 24 kHz sample rate.
instructions	str	A system message that sets the model's objective or role. For example: "You are an AI agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies."
enable_input_audio_transcription	bool	Whether to enable speech recognition for input audio.
input_audio_transcription_model	str	Speech recognition model for transcribing input audio. Always `qwen3-asr-flash-realtime`. This parameter is not configurable.
turn_detection_type	str	Server-side Voice Activity Detection (VAD) type. Valid values: `server_vad` (default): Detects the end of user speech based on acoustic features. `semantic_vad`: Detects the end of user speech based on semantic validity. This mode can filter out meaningless speech, such as filler words and background noise. This mode is supported only by the `qwen3.5-omni-realtime` model.
turn_detection_threshold	float	The VAD detection threshold. Increase this value in noisy environments and decrease it in quiet environments. The closer the value is to -1, the more likely noise is to be detected as speech. The closer the value is to 1, the less likely noise is to be detected as speech. Default value: 0.2. Valid values: [-1.0, 1.0].
turn_detection_silence_duration_ms	int	The duration of silence that indicates the end of speech. If this duration is exceeded, the model triggers a response. Default value: 800. Valid values: [200, 6000].
turn_detection_param	dict	Additional `turn_detection` configuration parameters. Currently supports `idle_timeout_ms` (int): the idle timeout in milliseconds. Applies only to `qwen3.5-omni-plus-realtime` and `qwen3.5-omni-flash-realtime` models in `server_vad` mode. After the server finishes audio playback and the user remains silent beyond this duration (no `speech.started` triggered), the model proactively generates a response to prompt the user to continue the conversation. Valid range: [5000, 30000]. Example: `turn_detection_param={'idle_timeout_ms': 5000}`
temperature	float	Sampling temperature controls content diversity. Higher values increase diversity, and lower values increase determinism. Valid values: [0, 2). Because both `temperature` and `top_p` control content diversity, set only one of them. Defaults: `qwen3.5-omni-realtime` series: 0.7, `qwen3-omni-flash-realtime` series: 0.9, `qwen-omni-turbo-realtime` series: 1.0.
top_p	float	Nucleus sampling probability threshold controls content diversity. Higher values increase diversity, and lower values increase determinism. Valid values: (0, 1.0]. Because both `temperature` and `top_p` control content diversity, set only one of them. Defaults: `qwen3.5-omni-realtime` series: 0.8, `qwen3-omni-flash-realtime` series: 1.0, `qwen-omni-turbo-realtime` series: 0.01.
top_k	integer	The size of the candidate set for sampling during generation. A larger value increases randomness. A smaller value increases determinism. If set to `None` or a value greater than 100, the `top_k` policy is not enabled, and only the `top_p` policy takes effect. The value must be greater than or equal to 0. Defaults: `qwen3.5-omni-realtime` series: 20, `qwen3-omni-flash-realtime` series: 50, `qwen-omni-turbo-realtime` series: 20.
max_tokens	integer	The maximum number of tokens to return for the current request. This parameter doesn't affect model generation. If output exceeds `max_tokens`, truncated content is returned. The default and maximum values are the maximum output length of the model. The `max_tokens` parameter is suitable for scenarios where you need to limit the word count, control costs, or reduce response time.
repetition_penalty	float	The degree of repetition in consecutive sequences during model generation. A higher value reduces the repetition. A value of 1.0 means no penalty. The value must be greater than 0. Defaults: `qwen3.5-omni-realtime` series: 1.0, `qwen3-omni-flash-realtime` series: 1.05.
presence_penalty	float	Controls the repetition of the content that the model generates. Valid values: [-2.0, 2.0]. A positive value reduces repetition, and a negative value increases repetition. A higher value is suitable for creative writing or brainstorming. A lower value is suitable for technical documents or formal documents. Defaults: `qwen3.5-omni-realtime` series: 1.5, `qwen3-omni-flash-realtime` series: 0.0.
seed	integer	Makes model generation more deterministic, ensuring consistent results across runs. If you pass the same seed value and keep other parameters unchanged, the model returns the same result as much as possible. Valid values: 0 to 2^31-1. Default value: -1.
enable_search	bool	Whether to enable web search.
search_options	dict	Web search options. Takes effect only when `enable_search` is enabled. Currently, you can only set `enable_source` (Boolean), which controls whether to return search result sources. Set to `true` to enable. Example: `search_options={'enable_source': True}`.
tools	list[dict]	Tool definitions. When provided, the model can call external tools to respond to user questions. If a tool is called, the model does not generate audio and only returns the tool calling parameters. Each tool is a dictionary with the following fields: `type` (str, required): Must be set to `"function"`. `function` (dict, required): The definition of the tool function, containing `name` (str, required), `description` (str, optional), and `parameters` (dict, optional, JSON Schema format with `type`, `properties`, and `required` fields).

Tool calling (tools) and web search (enable_search) are incompatible. You cannot enable both at the same time.

qwen-omni-turbo models do not support modification of temperature, top_p, top_k, max_tokens, repetition_penalty, presence_penalty, or seed.

Key interfaces

OmniRealtimeConversation class

Import the OmniRealtimeConversation class using the statement from dashscope.audio.qwen_omni import OmniRealtimeConversation.

connect

def connect(self) -> None

Creates a connection with the server. Server response events: session.created (Session created), session.updated (Session configuration updated).

update_session

def update_session(self,
                   output_modalities: list[MultiModality],
                   voice: str,
                   input_audio_format: AudioFormat = AudioFormat.PCM_16000HZ_MONO_16BIT,
                   output_audio_format: AudioFormat = AudioFormat.PCM_24000HZ_MONO_16BIT,
                   enable_input_audio_transcription: bool = True,
                   input_audio_transcription_model: str = None,
                   enable_turn_detection: bool = True,
                   turn_detection_type: str = 'server_vad',
                   prefix_padding_ms: int = 300,
                   turn_detection_threshold: float = 0.2,
                   turn_detection_silence_duration_ms: int = 800,
                   turn_detection_param: dict = None,
                   **kwargs) -> None

Updates session configuration. After establishing the connection, the server returns default configurations. Call this method immediately to override defaults. The server validates parameters and returns an error if invalid. For parameter settings, see the Request parameters section. Server response events: session.updated (Session configuration updated).

append_audio

def append_audio(self, audio_b64: str) -> None

Appends Base64-encoded audio to cloud input buffer (temporary storage for later commit).

If turn_detection is enabled, the audio buffer is used for voice detection, and the server decides when to commit.
If turn_detection is disabled, the client can choose the amount of audio to place in each event, up to a maximum of 15 MiB. Streaming smaller blocks of data from the client can make VAD more responsive.

Server response events: None.

append_video

def append_video(self, video_b64: str) -> None

Adds Base64-encoded image data to cloud video buffer (local or real-time stream). The following limits apply to image input:

The image format must be JPG or JPEG. The recommended image resolution is 480p or 720p, with a maximum of 1080p.
A single image after Base64 encoding must not exceed 256 KB. We recommend keeping the raw image size below 190 KB before encoding.
The image data must be Base64 encoded.
Send images to the server at a frequency of 1 image/second.

Server response events: None.

clear_appended_audio

def clear_appended_audio(self) -> None

Clears audio from cloud buffer. Server response events: input_audio_buffer.cleared (Clears the audio received by the server).

commit

def commit(self) -> None

Commits audio and video added via append. Returns an error if the buffer is empty.

If turn_detection is enabled, the client does not need to send this event. The server automatically commits the audio buffer.
If turn_detection is disabled, the client must commit the audio buffer to create a user message item.

If audio transcription is configured for the session using input_audio_transcription, the system transcribes the audio.
Committing the input audio buffer does not create a response from the model.

Server response events: input_audio_buffer.committed (Server received the committed audio).

create_response

def create_response(self,
          instructions: str = None,
          output_modalities: list[MultiModality] = None) -> None

Instructs server to create model response (automatic when turn_detection is enabled). Server response events: response.created, response.output_item.added, conversation.item.created, response.content_part.added, response.audio_transcript.delta, response.audio.delta, response.audio_transcript.done, response.audio.done, response.content_part.done, response.output_item.done, response.done.

cancel_response

def cancel_response(self) -> None

Cancels the in-progress response (returns an error if none exists). Server response events: None.

create_item

def create_item(self, item: dict) -> None

Sends a conversation.item.create event to the server. In tool calling scenarios, use this method to send the tool execution result back to the server. The item parameter is a dictionary that must contain the following fields:

type: Must be "function_call_output".
call_id: The call_id from the response.function_call_arguments.done event.
output: A string that contains the tool execution result.

Server response events: None.

close

def close(self) -> None

Terminates the task and closes the connection. Server response events: None.

get_session_id

def get_session_id(self) -> str

Gets the session_id of the current task. Server response events: None.

get_last_response_id

def get_last_response_id(self) -> str

Gets the response_id of the last response. Server response events: None.

Callback interface (OmniRealtimeCallback)

The server returns events and data via callbacks. Implement callback methods to handle server responses. Import the interface using the statement from dashscope.audio.qwen_omni import OmniRealtimeCallback.

Method	Parameters	Return value	Description
`on_open(self)`	None	None	Called immediately after the connection is established.
`on_event(self, message: str)`	`message`: A server response event.	None	Handles interface call responses and model-generated text/audio. See Server events.
`on_close(self, close_status_code, close_msg)`	`close_status_code`: The status code for closing the WebSocket. `close_msg`: The closing message for the WebSocket.	None	Called after the connection is closed.

​Prerequisites

​Request parameters

​Key interfaces

​OmniRealtimeConversation class

​connect

​update_session

​append_audio

​append_video

​clear_appended_audio

​commit

​create_response

​cancel_response

​create_item

​close

​get_session_id

​get_last_response_id

​Callback interface (OmniRealtimeCallback)

Prerequisites

Request parameters

Key interfaces

OmniRealtimeConversation class

connect

update_session

append_audio

append_video

clear_appended_audio

commit

create_response

cancel_response

create_item

close

get_session_id

get_last_response_id

Callback interface (OmniRealtimeCallback)