Real-time speech translation with 3-second latency
Model details
qwen3.5-livetranslate-flash-realtime is a vision-enhanced real-time translation model supporting 60 languages (29 with audio + text, 31 text-only). It processes audio and image input from video streams or local files, uses visual context to improve accuracy, and outputs translated text and audio in real time.
Key features:
- Multi-language support: Translates between 60 languages — 29 with audio and text output, 31 with text-only output — including Chinese, English, French, German, Russian, Japanese, Korean, Spanish, Portuguese, and Arabic.
- Visual enhancement: Analyzes visual cues, such as lip movements, gestures, and on-screen text, to improve translation accuracy, especially in noisy environments or for ambiguous words.
- 3-second latency: Delivers simultaneous interpretation with latency as low as 3 seconds.
- Lossless simultaneous interpretation: Predicts semantic units to resolve cross-language word order differences, achieving quality comparable to offline translation.
- Natural voice: Matches the intonation and emotion of the source audio automatically.
- Hotword configuration: Configurable hotwords improve translation accuracy for specific terms.
- Voice cloning: Clones the speaker's voice for translated output. Supports server-side real-time cloning and pre-cloned voice profiles.
| Model | Version | Context window | Max input | Max output |
|---|---|---|---|---|
| qwen3.5-livetranslate-flash-realtime (Alias for qwen3.5-livetranslate-flash-realtime-2026-05-19) | Stable | 53,248 | 49,152 | 4,096 |
| qwen3.5-livetranslate-flash-realtime-2026-05-19 | Snapshot | 53,248 | 49,152 | 4,096 |
| qwen3-livetranslate-flash-realtime (Alias for qwen3-livetranslate-flash-realtime-2025-09-22) | Stable | 53,248 | 49,152 | 4,096 |
| qwen3-livetranslate-flash-realtime-2025-09-22 | Snapshot | 53,248 | 49,152 | 4,096 |
Getting started
Prepare the environment
Requires Python 3.10 or later.
First, install pyaudio.
- macOS
- Debian/Ubuntu
- CentOS
- Windows
Create the client
Create a file named livetranslate_client.py with the following code:
Client code - livetranslate_client.py
Client code - livetranslate_client.py
Interact with the model
In the same directory, create a file named main.py with the following code:
main.py
main.py
main.py and speak into your microphone. The model outputs translated audio and text in real time. The system automatically detects speech and sends it to the server.
How to use
1. Configure the connection
The qwen3.5-livetranslate-flash-realtime model uses the WebSocket protocol. The connection requires the following parameters:
| Parameter | Description |
|---|---|
| endpoint | wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime |
| query parameter | The model query parameter must be set to the model name. Example: ?model=qwen3.5-livetranslate-flash-realtime |
| message header | Use a Bearer Token for authentication: Authorization: Bearer DASHSCOPE_API_KEY |
DASHSCOPE_API_KEY is your API key from Qwen Cloud.
Python sample code for WebSocket connection
Python sample code for WebSocket connection
2. Configure language, modality, and voice
Send the session.update client event with the following parameters:
-
Language
-
Source language: Configure using the
session.input_audio_transcription.languageparameter.The default value isen(English). -
Target language: Configure using the
session.translation.languageparameter.The default value isen(English).
-
Source language: Configure using the
-
Output source language recognition results
To enable this feature, set the
session.input_audio_transcription.modelparameter. When set toqwen3-asr-flash-realtime, the server returns both the translation and the speech recognition result (original text) for the input audio. When this feature is enabled, the server returns the following events:conversation.item.input_audio_transcription.text: Streams the recognition results.conversation.item.input_audio_transcription.completed: Returns the final result after the recognition is complete.
-
Output modality
Set the
session.modalitiesparameter to["text"](text only) or["text","audio"](text and audio). -
Voice
Configure using the
session.voiceparameter. See Supported voices. -
Hotword
Configure hotwords using the
session.translation.corpus.phrasesparameter. Hotwords are key-value pairs that map source terms to target translations, improving accuracy for specific terms. Example: Map"artificial intelligence"to"Artificial Intelligence". -
Voice cloning
Configure using the
session.enable_voice_clone,session.voice_clone_options.frequency, andsession.voiceparameters. Supports three modes: pre-cloned voice profile (frequency:never), server-side clone once at session start (once), or real-time clone before each response (always). See Voice cloning.
3. Input audio and images
Send Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required; image input is optional.
Images can be from a local file or captured in real time from a video stream. The server automatically detects speech boundaries and triggers the model response.
4. Receive the model response
The model responds when the server detects the end of speech. The response format depends on the output modality.
- Text-only output The server returns the complete translated text in a response.text.done event.
-
Text and audio output
- Text: The complete translated text is returned in a response.audio_transcript.done event.
- Audio: Incremental, Base64-encoded audio data is returned in response.audio.delta events.
5. End the session
After sending all audio, send a session.finish event, then wait for the server to return a session.finished event before closing the WebSocket connection.
If you close the WebSocket without sending
session.finish, the server's VAD cannot detect the end of the final speech segment. This causes translation results for that segment to be lost entirely, and the connection may hang indefinitely. Always send this event before disconnecting.Voice cloning
The model clones the speaker's voice from the input audio and uses the cloned voice for translated output, so the translation sounds like the speaker delivering it in another language. Use a pre-cloned voice profile, or let the server clone the voice in real time. This is useful in scenarios where preserving the speaker's voice matters, such as conference interpreting, live streaming, and video dubbing.
Set the following parameters in session.update to enable voice cloning:
session.enable_voice_clone: Set totrueto enable voice cloning.session.voice_clone_options.frequency: Controls when voice cloning occurs. Accepted values:never: Does not clone on the server. Uses a pre-cloned voice profile instead. Setsession.voiceto your custom cloned voice ID.once: Clones the voice from the input audio once at session start, then reuses it for all subsequent output. Best for single-speaker scenarios. Setsession.voicetodefault.always: Clones the voice before each response, dynamically adapting to speaker changes. Best for multi-speaker conversations. Setsession.voicetodefault.
session.voice: Specifies the output voice. The value depends on thefrequencysetting:- Set to
default: Use withfrequencyset toonceoralways. The server clones the speaker's voice from the input audio. A default voice is used until cloning completes. - Set to a custom cloned voice ID (for example,
qwen-translate-vc-xxx-yyy-zzz): Use withfrequencyset tonever. You must prepare the voice in advance using the Voice Cloning API withtargetModelset toqwen3.5-livetranslate-flash-realtime.
- Set to
When
frequency is set to once or always, the voice parameter must be set to default. Any other value causes the server to return an error.Voice cloning configuration examples
Pre-cloned voice profile (consistent quality; recommended when a stable voice identity is required):
Interaction flow
Real-time speech translation follows an event-driven WebSocket model. The server automatically detects speech boundaries and responds.
| Lifecycle | Client event | Server event |
|---|---|---|
| Session initialization | session.update (Session configuration) | session.created (Session created), session.updated (Session configuration updated) |
| User audio input | input_audio_buffer.append (Append audio to the buffer) | None |
| Server audio output | None | response.created (Signals that the server starts generating a response), response.output_item.added (Signals that a new output item is available), response.content_part.added (Signals that a new content part has been added to the assistant message), response.audio_transcript.text (Contains an incremental update to the text transcript), response.audio.delta (Contains an incremental chunk of the synthesized audio), response.audio_transcript.done (Signals that the full text transcript is complete), response.audio.done (Signals that the synthesized audio is complete), response.content_part.done (Signals that a text or audio content part for the assistant message is complete), response.output_item.done (Signals that the entire output item for the assistant message is complete), response.done (Signals that the entire response is complete) |
Improve translation with images
The qwen3.5-livetranslate-flash-realtime model uses image input to improve audio translation, helping disambiguate homonyms and recognize uncommon proper nouns. Send no more than 2 images per second.
Download the following sample images: medical mask.png and masquerade mask.png
Download the following code to the same directory as livetranslate_client.py and run it. Say "What is mask?" into your microphone. The model uses the provided image to disambiguate the word "mask." For example, using the medical mask.png file translates the phrase as "What is a medical mask?", while using the masquerade mask.png file translates it as "What is a masquerade mask?".
Billing
Qwen3.5-LiveTranslate-Flash-Realtime
- Audio: 7 tokens per second of input audio; 12.5 tokens per second of output audio.
- Image: Every 32x32 pixels consumes 0.5 tokens.
- Text: When source language speech recognition is enabled, the service returns a transcript of the input audio in addition to the translation. This transcript is billed as output text tokens.
- Audio: Each second of audio input or output consumes 12.5 tokens.
- Image: Every 28x28 pixels consumes 0.5 tokens.
- Text: When source language speech recognition is enabled, the service returns a transcript of the input audio in addition to the translation. This transcript is billed as output text tokens.
Supported languages
Use the following language codes to specify the source and target languages.
Some target languages only support text. The legacy model qwen3-livetranslate-flash-realtime supports only the following 18 languages: en, zh, ru, fr, de, pt, es, it, id, ko, ja, vi, th, ar, yue, hi, el, tr.
| Language code | Language | Output |
|---|---|---|
| zh | Chinese | Audio + text |
| en | English | Audio + text |
| ar | Arabic | Audio + text |
| de | German | Audio + text |
| fr | French | Audio + text |
| es | Spanish | Audio + text |
| pt | Portuguese | Audio + text |
| id | Indonesian | Audio + text |
| it | Italian | Audio + text |
| ko | Korean | Audio + text |
| ru | Russian | Audio + text |
| th | Thai | Audio + text |
| vi | Vietnamese | Audio + text |
| ja | Japanese | Audio + text |
| tr | Turkish | Audio + text |
| hi | Hindi | Audio + text |
| ms | Malay | Audio + text |
| nl | Dutch | Audio + text |
| ur | Urdu | Audio + text |
| nb | Norwegian Bokmål | Audio + text |
| sv | Swedish | Audio + text |
| da | Danish | Audio + text |
| he | Hebrew | Audio + text |
| fi | Finnish | Audio + text |
| pl | Polish | Audio + text |
| is | Icelandic | Audio + text |
| cs | Czech | Audio + text |
| fil | Filipino | Audio + text |
| fa | Persian | Audio + text |
| yue | Cantonese | Text |
| el | Greek | Text |
| af | Afrikaans | Text |
| ast | Asturian | Text |
| be | Belarusian | Text |
| bg | Bulgarian | Text |
| bn | Bengali | Text |
| bs | Bosnian | Text |
| ca | Catalan | Text |
| ceb | Cebuano | Text |
| et | Estonian | Text |
| gl | Galician | Text |
| gu | Gujarati | Text |
| hr | Croatian | Text |
| hu | Hungarian | Text |
| jv | Javanese | Text |
| kk | Kazakh | Text |
| kn | Kannada | Text |
| ky | Kyrgyz | Text |
| lv | Latvian | Text |
| mk | Macedonian | Text |
| ml | Malayalam | Text |
| mr | Marathi | Text |
| pa | Punjabi | Text |
| ro | Romanian | Text |
| sk | Slovak | Text |
| sl | Slovenian | Text |
| sw | Swahili | Text |
| tg | Tajik | Text |
| az | Azerbaijani | Text |
| uk | Ukrainian | Text |
Supported voices
For supported voices and the corresponding voice parameter values, see the API reference.