3-second latency streaming
Model details
qwen3-livetranslate-flash-realtime is a multilingual, real-time audio and video translation model. It can recognize 18 languages and translate them into audio in 11 languages in real time.
Core features:
- Multilingual support: Supports 18 languages, such as Chinese, English, French, German, Russian, Japanese, and Korean, and 6 Chinese dialects, including Mandarin, Cantonese, and Sichuanese.
- Vision enhancement: Uses visual content to improve translation accuracy. The model analyzes lip movements, actions, and on-screen text to enhance translation in noisy environments or for words with multiple meanings.
- 3-second latency: Achieves simultaneous interpretation latency as low as 3 seconds.
- Lossless simultaneous interpretation: Uses semantic unit prediction technology to resolve word order issues between languages. The real-time translation quality is close to that of offline translation.
- Natural voice: Generates natural, human-like speech. The model automatically adjusts its tone and emotion based on the source audio.
| Model | Version | Context window | Max input | Max output |
|---|---|---|---|---|
| qwen3-livetranslate-flash-realtime (Current capabilities are equivalent to qwen3-livetranslate-flash-realtime-2025-09-22) | Stable | 53,248 | 49,152 | 4,096 |
| qwen3-livetranslate-flash-realtime-2025-09-22 | Snapshot | 53,248 | 49,152 | 4,096 |
Getting started
Prepare the environment
Your Python version must be 3.10 or later.
First, install pyaudio.
- macOS
- Debian/Ubuntu
- CentOS
- Windows
Create the client
Create a new Python file locally, name it livetranslate_client.py, and copy the following code into the file:
Client code - livetranslate_client.py
Client code - livetranslate_client.py
Interact with the model
In the same folder as livetranslate_client.py, create another Python file, name it main.py, and copy the following code into the file:
main.py
main.py
main.py and speak the sentences you want to translate into the microphone. The model outputs the translated audio and text in real time. The system automatically detects your speech and sends the audio to the server, so no manual action is required.
Request parameters
Configure the connection
qwen3-livetranslate-flash-realtime connects using the WebSocket protocol. The connection requires the following configuration items:
| Configuration | Description |
|---|---|
| Endpoint | wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
| Query parameter | The query parameter is model. Set it to the name of the model you want to access. Example: ?model=qwen3-livetranslate-flash-realtime |
| Message header | Use Bearer Token for authentication: Authorization: Bearer $DASHSCOPE_API_KEY. DASHSCOPE_API_KEY is the API key that you request from Qwen Cloud. |
WebSocket connection Python sample code
WebSocket connection Python sample code
Set the language, output modality, and voice
Send the client event session.update:
-
Language
-
Source language: Use the
session.input_audio_transcription.languageparameter.The default value isen(English). -
Target language: Use the
session.translation.languageparameter.The default value isen(English).
-
Source language: Use the
-
Output source language recognition results
Use the
session.input_audio_transcription.modelparameter. When you set the parameter toqwen3-asr-flash-realtime, the server returns the speech recognition result of the input audio (the original source language text) in addition to the translation. When this feature is enabled, the server returns the following events:conversation.item.input_audio_transcription.text: Returns the recognition result as a stream.conversation.item.input_audio_transcription.completed: Returns the final result after recognition is complete.
-
Output modality
Use the
session.modalitiesparameter. Supported values are["text"](text only) and["text","audio"](text and audio). -
Voice
Use the
session.voiceparameter. See Supported voices.
Input audio and images
The client sends Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required. Image input is optional.
Images can be from local files or a real-time video stream. The server automatically detects the start and end of the audio and triggers a model response.
Receive the model response
When the server detects the end of the audio, the model responds. The response format depends on the configured output modality.
- Text-only output The server returns the complete translated text in a response.text.done event.
-
Text and audio output
- Text: The complete translated text is returned in a response.audio_transcript.done event.
- Audio: Incremental, Base64-encoded audio data is returned in response.audio.delta events.
Parse the response
The interaction flow for real-time speech translation follows the standard WebSocket event-driven model, where the server automatically detects the start and end of speech and responds.
| Lifecycle | Client event | Server event |
|---|---|---|
| Session initialization | session.update (Session configuration) | session.created (Session created), session.updated (Session configuration updated) |
| User audio input | input_audio_buffer.append (Add audio to buffer), input_image_buffer.append (Add image to buffer) | None |
| Server audio output | None | response.created (Server starts generating response), response.output_item.added (New output content in response), response.content_part.added (New output content added to assistant message), response.audio_transcript.text (Incrementally generated transcript text), response.audio.delta (Incrementally generated audio from the model), response.audio_transcript.done (Text transcription complete), response.audio.done (Audio generation complete), response.content_part.done (Streaming of text or audio content for the assistant message is complete), response.output_item.done (Streaming of the entire output item for the assistant message is complete), response.done (Response complete) |
Use images to improve translation accuracy
qwen3-livetranslate-flash-realtime can accept image input to assist with audio translation. This is useful for scenarios involving homonyms or recognizing uncommon proper nouns. You can send a maximum of two images per second.
Download the following sample images to your local computer: medical mask.png and masquerade mask.png
Download the following code to the same folder as livetranslate_client.py and run it. Say "What is mask?" into the microphone. When you input the medical mask image, the model translates the phrase as "What is a medical mask?". When you input the masquerade mask image, the model translates the phrase as "What is a masquerade mask?".
Billing
- Audio: Each second of audio input or output consumes 12.5 tokens.
- Image: Every 28*28 pixels of input consumes 0.5 tokens.
- Text: If you enable the source language speech recognition feature, the service returns the speech recognition text of the input audio (the original source language text) in addition to the translation result. This recognition text is billed based on the standard token rate for output text.
Supported languages
The following language codes can be used for source and target languages. Some target languages support text output only.
| Language code | Language | Supported output |
|---|---|---|
| en | English | Audio, text |
| zh | Chinese | Audio, text |
| ru | Russian | Audio, text |
| fr | French | Audio, text |
| de | German | Audio, text |
| pt | Portuguese | Audio, text |
| es | Spanish | Audio, text |
| it | Italian | Audio, text |
| id | Indonesian | Text |
| ko | Korean | Audio, text |
| ja | Japanese | Audio, text |
| vi | Vietnamese | Text |
| th | Thai | Text |
| ar | Arabic | Text |
| yue | Cantonese | Audio, text |
| hi | Hindi | Text |
| el | Greek | Text |
| tr | Turkish | Text |
Supported voices
Set the voice parameter when the output includes synthesized audio.
| Voice name | voice parameter | Description | Supported languages |
|---|---|---|---|
| Cherry | Cherry | A cheerful, friendly, and genuine young woman. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean |
| Nofish | Nofish | A designer who has difficulty pronouncing retroflex consonants. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean |
| Shanghai-Jada | Jada | A bustling and energetic Shanghai lady. | Chinese |
| Beijing-Dylan | Dylan | A young man who grew up in the hutongs of Beijing. | Chinese |
| Sichuan-Sunny | Sunny | A sweet girl from Sichuan. | Chinese |
| Tianjin-Peter | Peter | A voice in the style of a Tianjin crosstalk performer (the supporting role). | Chinese |
| Cantonese-Kiki | Kiki | A sweet best friend from Hong Kong. | Cantonese |
| Sichuan-Eric | Eric | A man from Chengdu, Sichuan, who is unconventional and stands out from the crowd. | Chinese |
| Ethan | Ethan | Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant. | Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean |
Alternative: Use Qwen-Omni
You can also use Qwen-Omni (qwen3-omni-flash-realtime) with a translation prompt for real-time audio and video translation via WebSocket.
Qwen-Omni-Realtime uses WebSocket for bidirectional streaming. For the full API and SDK reference, see Realtime audio and video understanding.