Qwen-Omni-Realtime Python SDK
This topic describes the key interfaces and request parameters for the Qwen-Omni-Realtime DashScope Python SDK.
SDK version 1.23.9+ is required. See Real-time multimodal interaction flow.
Visit GitHub to download the sample code. We provide sample code for three calling methods:
Set the following request parameters using the constructor method (
Configure the following request parameters using the
Import the
Creates a connection with the server.
Server response events: session.created (Session created), session.updated (Session configuration updated).
Updates session configuration. After establishing the connection, the server returns default configurations. Call this method immediately to override defaults. The server validates parameters and returns an error if invalid. For parameter settings, see the Request parameters section.
Server response events: session.updated (Session configuration updated).
Appends Base64-encoded audio to cloud input buffer (temporary storage for later commit).
Adds Base64-encoded image data to cloud video buffer (local or real-time stream).
The following limits apply to image input:
Clears audio from cloud buffer.
Server response events: input_audio_buffer.cleared (Clears the audio received by the server).
Commits audio and video added via
Server response events: input_audio_buffer.committed (Server received the committed audio).
Instructs server to create model response (automatic when
Cancels the in-progress response (returns an error if none exists).
Server response events: None.
Terminates the task and closes the connection.
Server response events: None.
Gets the
Gets the
Gets the first-packet text latency of the last response.
Server response events: None.
Gets the first-packet audio latency of the last response.
Server response events: None.
The server returns events and data via callbacks. Implement callback methods to handle server responses.
Import the interface using the statement
Prerequisites
SDK version 1.23.9+ is required. See Real-time multimodal interaction flow.
Getting started
Visit GitHub to download the sample code. We provide sample code for three calling methods:
- Audio conversation example: Captures real-time audio from a microphone, enables VAD mode for automatic speech start and end detection (
enable_turn_detection= True), and supports voice interruption.
Use headphones for audio playback to prevent echoes from triggering voice interruption.
- Audio and video conversation example: Captures real-time audio and video from a microphone and a camera, enables VAD mode for automatic speech start and end detection (
enable_turn_detection= True), and supports voice interruption.
Use headphones for audio playback to prevent echoes from triggering voice interruption.
- Local call: Uses local audio and images as input, enables Manual mode for manual control of the pace of sending (
enable_turn_detection= False).
Request parameters
Set the following request parameters using the constructor method (__init__) of the OmniRealtimeConversation class.
| Parameter | Type | Description |
|---|---|---|
| model | str | The Qwen-Omni model name. See Model list. |
| callback | OmniRealtimeCallback | The callback object instance that handles server-side events. |
| url | str | The call address: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
update_session method.
| Parameter | Type | Description |
|---|---|---|
| output_modalities | list[MultiModality] | Model output modality. Set to [MultiModality.TEXT] for text only, or [MultiModality.TEXT, MultiModality.AUDIO] for both. |
| voice | str | The voice for the audio generated by the model. For a list of supported voices, see Voice list. Default voices: Qwen3-Omni-Flash-Realtime: "Cherry", Qwen-Omni-Turbo-Realtime: "Chelsie". |
| input_audio_format | AudioFormat | Input audio format. Currently only PCM_16000HZ_MONO_16BIT is supported. |
| output_audio_format | AudioFormat | The format of the model's output audio. Currently, only pcm is supported. |
| smooth_output | bool | Supported only by the Qwen3-Omni-Flash-Realtime series. True: Get a conversational response. False: Get a more formal, written-style response (may result in poor quality if the content is difficult to read aloud). None: The model automatically selects a response style. |
| instructions | str | A system message that sets the model's objective or role. For example: "You are an AI agent for a five-star hotel. Answer customer questions about room types, facilities, prices, and booking policies." |
| enable_input_audio_transcription | bool | Whether to enable speech recognition for input audio. |
| input_audio_transcription_model | str | Speech recognition model for transcribing input audio. Currently only gummy-realtime-v1 is supported. |
| turn_detection_type | str | Server-side Voice Activity Detection (VAD) type. This is fixed to server_vad. |
| turn_detection_threshold | float | The VAD detection threshold. Increase this value in noisy environments and decrease it in quiet environments. The closer the value is to -1, the more likely noise is to be detected as speech. The closer the value is to 1, the less likely noise is to be detected as speech. Default value: 0.5. Valid values: [-1.0, 1.0]. |
| turn_detection_silence_duration_ms | int | The duration of silence that indicates the end of speech. If this duration is exceeded, the model triggers a response. Default value: 800. Valid values: [200, 6000]. |
| temperature | float | Sampling temperature controls content diversity. Higher values increase diversity, and lower values increase determinism. Valid values: [0, 2). Because both temperature and top_p control content diversity, set only one of them. Defaults: qwen3-omni-flash-realtime series: 0.9, qwen-omni-turbo-realtime series: 1.0. |
| top_p | float | Nucleus sampling probability threshold controls content diversity. Higher values increase diversity, and lower values increase determinism. Valid values: (0, 1.0]. Because both temperature and top_p control content diversity, set only one of them. Defaults: qwen3-omni-flash-realtime series: 1.0, qwen-omni-turbo-realtime series: 0.01. |
| top_k | integer | The size of the candidate set for sampling during generation. A larger value increases randomness. A smaller value increases determinism. If set to None or a value greater than 100, the top_k policy is not enabled, and only the top_p policy takes effect. The value must be greater than or equal to 0. Defaults: qwen3-omni-flash-realtime series: 50, qwen-omni-turbo-realtime series: 20. |
| max_tokens | integer | The maximum number of tokens to return for the current request. This parameter doesn't affect model generation. If output exceeds max_tokens, truncated content is returned. The default and maximum values are the maximum output length of the model. The max_tokens parameter is suitable for scenarios where you need to limit the word count, control costs, or reduce response time. |
| repetition_penalty | float | The degree of repetition in consecutive sequences during model generation. A higher value reduces the repetition. A value of 1.0 means no penalty. The value must be greater than 0. Default value: 1.05. |
| presence_penalty | float | Controls the repetition of the content that the model generates. Default value: 0.0. Valid values: [-2.0, 2.0]. A positive value reduces repetition, and a negative value increases repetition. A higher value is suitable for creative writing or brainstorming. A lower value is suitable for technical documents or formal documents. |
| seed | integer | Makes model generation more deterministic, ensuring consistent results across runs. If you pass the same seed value and keep other parameters unchanged, the model returns the same result as much as possible. Valid values: 0 to 2^31-1. Default value: -1. |
qwen-omni-turbo models do not support modification of temperature, top_p, top_k, max_tokens, repetition_penalty, presence_penalty, or seed.Key interfaces
OmniRealtimeConversation class
Import the OmniRealtimeConversation class using the statement from dashscope.audio.qwen_omni import OmniRealtimeConversation.
connect
update_session
append_audio
- If
turn_detectionis enabled, the audio buffer is used for voice detection, and the server decides when to commit. - If
turn_detectionis disabled, the client can choose the amount of audio to place in each event, up to a maximum of 15 MiB. Streaming smaller blocks of data from the client can make VAD more responsive.
append_video
- The image format must be JPG or JPEG. The recommended image resolution is 480p or 720p, with a maximum of 1080p.
- The size of a single image cannot exceed 500 KB before Base64 encoding.
- The image data must be Base64 encoded.
- Send images to the server at a frequency of 1 image/second.
clear_appended_audio
commit
append. Returns an error if the buffer is empty.
- If
turn_detectionis enabled, the client does not need to send this event. The server automatically commits the audio buffer. - If
turn_detectionis disabled, the client must commit the audio buffer to create a user message item.
- If audio transcription is configured for the session using
input_audio_transcription, the system transcribes the audio. - Committing the input audio buffer does not create a response from the model.
create_response
turn_detection is enabled).
Server response events: response.created, response.output_item.added, conversation.item.created, response.content_part.added, response.audio_transcript.delta, response.audio.delta, response.audio_transcript.done, response.audio.done, response.content_part.done, response.output_item.done, response.done.
cancel_response
close
get_session_id
session_id of the current task.
Server response events: None.
get_last_response_id
response_id of the last response.
Server response events: None.
get_last_first_text_delay
get_last_first_audio_delay
Callback interface (OmniRealtimeCallback)
The server returns events and data via callbacks. Implement callback methods to handle server responses.
Import the interface using the statement from dashscope.audio.qwen_omni import OmniRealtimeCallback.
| Method | Parameters | Return value | Description |
|---|---|---|---|
on_open(self) | None | None | Called immediately after the connection is established. |
on_event(self, message: str) | message: A server response event. | None | Handles interface call responses and model-generated text/audio. See Server events. |
on_close(self, close_status_code, close_msg) | close_status_code: The status code for closing the WebSocket. close_msg: The closing message for the WebSocket. | None | Called after the connection is closed. |