Skip to main content
Translation

Realtime audio and video translation

3-second latency streaming

Model details

qwen3-livetranslate-flash-realtime is a multilingual, real-time audio and video translation model. It can recognize 18 languages and translate them into audio in 11 languages in real time. Core features:
  • Multilingual support: Supports 18 languages, such as Chinese, English, French, German, Russian, Japanese, and Korean, and 6 Chinese dialects, including Mandarin, Cantonese, and Sichuanese.
  • Vision enhancement: Uses visual content to improve translation accuracy. The model analyzes lip movements, actions, and on-screen text to enhance translation in noisy environments or for words with multiple meanings.
  • 3-second latency: Achieves simultaneous interpretation latency as low as 3 seconds.
  • Lossless simultaneous interpretation: Uses semantic unit prediction technology to resolve word order issues between languages. The real-time translation quality is close to that of offline translation.
  • Natural voice: Generates natural, human-like speech. The model automatically adjusts its tone and emotion based on the source audio.
ModelVersionContext windowMax inputMax output
qwen3-livetranslate-flash-realtime (Current capabilities are equivalent to qwen3-livetranslate-flash-realtime-2025-09-22)Stable53,24849,1524,096
qwen3-livetranslate-flash-realtime-2025-09-22Snapshot53,24849,1524,096

Getting started

Prepare the environment

Your Python version must be 3.10 or later. First, install pyaudio.
  • macOS
  • Debian/Ubuntu
  • CentOS
  • Windows
brew install portaudio && pip install pyaudio
After the installation is complete, use pip to install the required WebSocket dependencies:
pip install websocket-client==1.8.0 websockets

Create the client

Create a new Python file locally, name it livetranslate_client.py, and copy the following code into the file:
import os
import time
import base64
import asyncio
import json
import websockets
import pyaudio
import queue
import threading
import traceback

class LiveTranslateClient:
  def __init__(self, api_key: str, target_language: str = "en", voice: str | None = "Cherry", *, audio_enabled: bool = True):
    if not api_key:
      raise ValueError("API key cannot be empty.")

    self.api_key = api_key
    self.target_language = target_language
    self.audio_enabled = audio_enabled
    self.voice = voice if audio_enabled else "Cherry"
    self.ws = None
    self.api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"

    # Audio input configuration (from microphone)
    self.input_rate = 16000
    self.input_chunk = 1600
    self.input_format = pyaudio.paInt16
    self.input_channels = 1

    # Audio output configuration (for playback)
    self.output_rate = 24000
    self.output_chunk = 2400
    self.output_format = pyaudio.paInt16
    self.output_channels = 1

    # State management
    self.is_connected = False
    self.audio_player_thread = None
    self.audio_playback_queue = queue.Queue()
    self.pyaudio_instance = pyaudio.PyAudio()

  async def connect(self):
    """Establish a WebSocket connection to the translation service."""
    headers = {"Authorization": f"Bearer {self.api_key}"}
    try:
      self.ws = await websockets.connect(self.api_url, additional_headers=headers)
      self.is_connected = True
      print(f"Successfully connected to the server: {self.api_url}")
      await self.configure_session()
    except Exception as e:
      print(f"Connection failed: {e}")
      self.is_connected = False
      raise

  async def configure_session(self):
    """Configure the translation session, setting the target language, voice, and other parameters."""
    config = {
      "event_id": f"event_{int(time.time() * 1000)}",
      "type": "session.update",
      "session": {
        # 'modalities' controls the output type.
        # ["text", "audio"]: Returns both translated text and synthesized audio (recommended).
        # ["text"]: Returns only the translated text.
        "modalities": ["text", "audio"] if self.audio_enabled else ["text"],
        **({"voice": self.voice} if self.audio_enabled and self.voice else {}),
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm24",
        # 'input_audio_transcription' configures source language recognition.
        # Set 'model' to 'qwen3-asr-flash-realtime' to also output the source language recognition result.
        # "input_audio_transcription": {
        #     "model": "qwen3-asr-flash-realtime",
        #     "language": "zh"  # Source language, default is 'en'
        # },
        "translation": {
          "language": self.target_language
        }
      }
    }
    print(f"Sending session configuration: {json.dumps(config, indent=2, ensure_ascii=False)}")
    await self.ws.send(json.dumps(config))

  async def send_audio_chunk(self, audio_data: bytes):
    """Encode and send an audio data block to the server."""
    if not self.is_connected:
      return

    event = {
      "event_id": f"event_{int(time.time() * 1000)}",
      "type": "input_audio_buffer.append",
      "audio": base64.b64encode(audio_data).decode()
    }
    await self.ws.send(json.dumps(event))

  async def send_image_frame(self, image_bytes: bytes, *, event_id: str | None = None):
    # Send image data to the server
    if not self.is_connected:
      return

    if not image_bytes:
      raise ValueError("image_bytes cannot be empty")

    # Encode to Base64
    image_b64 = base64.b64encode(image_bytes).decode()

    event = {
      "event_id": event_id or f"event_{int(time.time() * 1000)}",
      "type": "input_image_buffer.append",
      "image": image_b64,
    }

    await self.ws.send(json.dumps(event))

  def _audio_player_task(self):
    stream = self.pyaudio_instance.open(
      format=self.output_format,
      channels=self.output_channels,
      rate=self.output_rate,
      output=True,
      frames_per_buffer=self.output_chunk,
    )
    try:
      while self.is_connected or not self.audio_playback_queue.empty():
        try:
          audio_chunk = self.audio_playback_queue.get(timeout=0.1)
          if audio_chunk is None: # Termination signal
            break
          stream.write(audio_chunk)
          self.audio_playback_queue.task_done()
        except queue.Empty:
          continue
    finally:
      stream.stop_stream()
      stream.close()

  def start_audio_player(self):
    """Start the audio player thread (only when audio output is enabled)."""
    if not self.audio_enabled:
      return
    if self.audio_player_thread is None or not self.audio_player_thread.is_alive():
      self.audio_player_thread = threading.Thread(target=self._audio_player_task, daemon=True)
      self.audio_player_thread.start()

  async def handle_server_messages(self, on_text_received):
    """Loop to process messages from the server."""
    try:
      async for message in self.ws:
        event = json.loads(message)
        event_type = event.get("type")
        if event_type == "response.audio.delta" and self.audio_enabled:
          audio_b64 = event.get("delta", "")
          if audio_b64:
            audio_data = base64.b64decode(audio_b64)
            self.audio_playback_queue.put(audio_data)

        elif event_type == "response.done":
          print("\n[INFO] A round of response is complete.")
          usage = event.get("response", {}).get("usage", {})
          if usage:
            print(f"[INFO] Token usage: {json.dumps(usage, indent=2, ensure_ascii=False)}")
        # Process source language recognition results (requires enabling input_audio_transcription.model)
        # elif event_type == "conversation.item.input_audio_transcription.text":
        #     stash = event.get("stash", "")  # Unconfirmed recognition text
        #     print(f"[Recognizing] {stash}")
        # elif event_type == "conversation.item.input_audio_transcription.completed":
        #     transcript = event.get("transcript", "")  # Complete recognition result
        #     print(f"[Source language] {transcript}")
        elif event_type == "response.audio_transcript.done":
          print("\n[INFO] Text translation complete.")
          text = event.get("transcript", "")
          if text:
            print(f"[INFO] Translated text: {text}")
        elif event_type == "response.text.done":
          print("\n[INFO] Text translation complete.")
          text = event.get("text", "")
          if text:
            print(f"[INFO] Translated text: {text}")

    except websockets.exceptions.ConnectionClosed as e:
      print(f"[WARNING] Connection closed: {e}")
      self.is_connected = False
    except Exception as e:
      print(f"[ERROR] An unknown error occurred while processing messages: {e}")
      traceback.print_exc()
      self.is_connected = False

  async def start_microphone_streaming(self):
    """Capture audio from the microphone and stream it to the server."""
    stream = self.pyaudio_instance.open(
      format=self.input_format,
      channels=self.input_channels,
      rate=self.input_rate,
      input=True,
      frames_per_buffer=self.input_chunk
    )
    print("Microphone is on. Start speaking...")
    try:
      while self.is_connected:
        audio_chunk = await asyncio.get_event_loop().run_in_executor(
          None, stream.read, self.input_chunk
        )
        await self.send_audio_chunk(audio_chunk)
    finally:
      stream.stop_stream()
      stream.close()

  async def close(self):
    """Gracefully close the connection and release resources."""
    self.is_connected = False
    if self.ws:
      await self.ws.close()
      print("WebSocket connection closed.")

    if self.audio_player_thread:
      self.audio_playback_queue.put(None) # Send termination signal
      self.audio_player_thread.join(timeout=1)
      print("Audio player thread stopped.")

    self.pyaudio_instance.terminate()
    print("PyAudio instance released.")

Interact with the model

In the same folder as livetranslate_client.py, create another Python file, name it main.py, and copy the following code into the file:
import os
import asyncio
from livetranslate_client import LiveTranslateClient

def print_banner():
  print("=" * 60)
  print("  Powered by Qwen qwen3-livetranslate-flash-realtime")
  print("=" * 60 + "\n")

def get_user_config():
  """Get user configuration"""
  print("Select a mode:")
  print("1. Voice + Text [Default] | 2. Text Only")
  mode_choice = input("Enter your choice (press Enter for Voice + Text): ").strip()
  audio_enabled = (mode_choice != "2")

  if audio_enabled:
    lang_map = {
      "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt",
      "7": "es", "8": "it", "9": "ko", "10": "ja", "11": "yue"
    }
    print("Select the target translation language (Voice + Text mode):")
    print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Korean | 10. Japanese | 11. Cantonese")
  else:
    lang_map = {
      "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt", "7": "es", "8": "it",
      "9": "id", "10": "ko", "11": "ja", "12": "vi", "13": "th", "14": "ar",
      "15": "yue", "16": "hi", "17": "el", "18": "tr"
    }
    print("Select the target translation language (Text Only mode):")
    print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Indonesian | 10. Korean | 11. Japanese | 12. Vietnamese | 13. Thai | 14. Arabic | 15. Cantonese | 16. Hindi | 17. Greek | 18. Turkish")

  choice = input("Enter your choice (defaults to the first option): ").strip()
  target_language = lang_map.get(choice, next(iter(lang_map.values())))

  voice = None
  if audio_enabled:
    print("\nSelect a speech synthesis voice:")
    voice_map = {"1": "Cherry", "2": "Nofish", "3": "Sunny", "4": "Jada", "5": "Dylan", "6": "Peter", "7": "Eric", "8": "Kiki", "9": "Ethan"}
    print("1. Cherry (Female) [Default] | 2. Nofish (Male) | 3. Sunny (Sichuan Female) | 4. Jada (Shanghai Female) | 5. Dylan (Beijing Male) | 6. Peter (Tianjin Male) | 7. Eric (Sichuan Male) | 8. Kiki (Cantonese Female) | 9. Ethan (Male)")
    voice_choice = input("Enter your choice (press Enter for Cherry): ").strip()
    voice = voice_map.get(voice_choice, "Cherry")
  return target_language, voice, audio_enabled

async def main():
  """Main program entry point"""
  print_banner()

  api_key = os.environ.get("DASHSCOPE_API_KEY")
  if not api_key:
    print("[ERROR] Set the DASHSCOPE_API_KEY environment variable.")
    print("  For example: export DASHSCOPE_API_KEY='YOUR_API_KEY'")
    return

  target_language, voice, audio_enabled = get_user_config()
  print("\nConfiguration complete:")
  print(f"  - Target language: {target_language}")
  if audio_enabled:
    print(f"  - Synthesized voice: {voice}")
  else:
    print("  - Output mode: Text Only")

  client = LiveTranslateClient(api_key=api_key, target_language=target_language, voice=voice, audio_enabled=audio_enabled)

  # Define the callback function
  def on_translation_text(text):
    print(text, end="", flush=True)

  try:
    print("Connecting to the translation service...")
    await client.connect()

    # Start audio playback based on the mode
    client.start_audio_player()

    print("\n" + "-" * 60)
    print("Connection successful! Speak into the microphone.")
    print("The program will translate your speech in real time and play the result. Press Ctrl+C to exit.")
    print("-" * 60 + "\n")

    # Run message handling and microphone recording concurrently
    message_handler = asyncio.create_task(client.handle_server_messages(on_translation_text))
    tasks = [message_handler]
    # Audio must be captured from the microphone for translation, regardless of whether audio output is enabled
    microphone_streamer = asyncio.create_task(client.start_microphone_streaming())
    tasks.append(microphone_streamer)

    await asyncio.gather(*tasks)

  except KeyboardInterrupt:
    print("\n\nUser interrupted. Exiting...")
  except Exception as e:
    print(f"\nA critical error occurred: {e}")
  finally:
    print("\nCleaning up resources...")
    await client.close()
    print("Program exited.")

if __name__ == "__main__":
  asyncio.run(main())
Run main.py and speak the sentences you want to translate into the microphone. The model outputs the translated audio and text in real time. The system automatically detects your speech and sends the audio to the server, so no manual action is required.

Request parameters

Configure the connection

qwen3-livetranslate-flash-realtime connects using the WebSocket protocol. The connection requires the following configuration items:
ConfigurationDescription
Endpointwss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime
Query parameterThe query parameter is model. Set it to the name of the model you want to access. Example: ?model=qwen3-livetranslate-flash-realtime
Message headerUse Bearer Token for authentication: Authorization: Bearer $DASHSCOPE_API_KEY. DASHSCOPE_API_KEY is the API key that you request from Qwen Cloud.
Use the following Python sample code to establish a connection.
# pip install websocket-client
import json
import websocket
import os

API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"

headers = [
  "Authorization: Bearer " + API_KEY
]

def on_open(ws):
  print(f"Connected to server: {API_URL}")
def on_message(ws, message):
  data = json.loads(message)
  print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
  print("Error:", error)

ws = websocket.WebSocketApp(
  API_URL,
  header=headers,
  on_open=on_open,
  on_message=on_message,
  on_error=on_error
)

ws.run_forever()

Set the language, output modality, and voice

Send the client event session.update:
  • Language
    • Source language: Use the session.input_audio_transcription.language parameter.
      The default value is en (English).
    • Target language: Use the session.translation.language parameter.
      The default value is en (English).
    See Supported languages.
  • Output source language recognition results Use the session.input_audio_transcription.model parameter. When you set the parameter to qwen3-asr-flash-realtime, the server returns the speech recognition result of the input audio (the original source language text) in addition to the translation. When this feature is enabled, the server returns the following events:
    • conversation.item.input_audio_transcription.text: Returns the recognition result as a stream.
    • conversation.item.input_audio_transcription.completed: Returns the final result after recognition is complete.
  • Output modality Use the session.modalities parameter. Supported values are ["text"] (text only) and ["text","audio"] (text and audio).
  • Voice Use the session.voice parameter. See Supported voices.

Input audio and images

The client sends Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required. Image input is optional.
Images can be from local files or a real-time video stream. The server automatically detects the start and end of the audio and triggers a model response.

Receive the model response

When the server detects the end of the audio, the model responds. The response format depends on the configured output modality.

Parse the response

The interaction flow for real-time speech translation follows the standard WebSocket event-driven model, where the server automatically detects the start and end of speech and responds.
LifecycleClient eventServer event
Session initializationsession.update (Session configuration)session.created (Session created), session.updated (Session configuration updated)
User audio inputinput_audio_buffer.append (Add audio to buffer), input_image_buffer.append (Add image to buffer)None
Server audio outputNoneresponse.created (Server starts generating response), response.output_item.added (New output content in response), response.content_part.added (New output content added to assistant message), response.audio_transcript.text (Incrementally generated transcript text), response.audio.delta (Incrementally generated audio from the model), response.audio_transcript.done (Text transcription complete), response.audio.done (Audio generation complete), response.content_part.done (Streaming of text or audio content for the assistant message is complete), response.output_item.done (Streaming of the entire output item for the assistant message is complete), response.done (Response complete)

Use images to improve translation accuracy

qwen3-livetranslate-flash-realtime can accept image input to assist with audio translation. This is useful for scenarios involving homonyms or recognizing uncommon proper nouns. You can send a maximum of two images per second. Download the following sample images to your local computer: medical mask.png and masquerade mask.png Download the following code to the same folder as livetranslate_client.py and run it. Say "What is mask?" into the microphone. When you input the medical mask image, the model translates the phrase as "What is a medical mask?". When you input the masquerade mask image, the model translates the phrase as "What is a masquerade mask?".
import os
import time
import json
import asyncio
import contextlib
import functools

from livetranslate_client import LiveTranslateClient

IMAGE_PATH = "mask_medical.png"
# IMAGE_PATH = "mask_masquerade.png"

def print_banner():
  print("=" * 60)
  print("  Powered by Qwen qwen3-livetranslate-flash-realtime - Single-turn interaction example (mask)")
  print("=" * 60 + "\n")

async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
  pa = client.pyaudio_instance
  stream = pa.open(
    format=client.input_format,
    channels=client.input_channels,
    rate=client.input_rate,
    input=True,
    frames_per_buffer=client.input_chunk,
  )
  print(f"[INFO] Recording started. Please speak...")
  loop = asyncio.get_event_loop()
  last_img_time = 0.0
  frame_interval = 0.5  # 2 fps
  try:
    while client.is_connected:
      data = await loop.run_in_executor(None, stream.read, client.input_chunk)
      await client.send_audio_chunk(data)

      # Append an image frame every 0.5 seconds
      now = time.time()
      if now - last_img_time >= frame_interval:
        await client.send_image_frame(image_bytes)
        last_img_time = now
  finally:
    stream.stop_stream()
    stream.close()

async def main():
  print_banner()
  api_key = os.environ.get("DASHSCOPE_API_KEY")
  if not api_key:
    print("[ERROR] First, configure the API KEY in the DASHSCOPE_API_KEY environment variable.")
    return

  client = LiveTranslateClient(api_key=api_key, target_language="zh", voice="Cherry", audio_enabled=True)

  def on_text(text: str):
    print(text, end="", flush=True)

  try:
    await client.connect()
    client.start_audio_player()
    message_task = asyncio.create_task(client.handle_server_messages(on_text))
    with open(IMAGE_PATH, "rb") as f:
      img_bytes = f.read()
    await stream_microphone_once(client, img_bytes)
    await asyncio.sleep(15)
  finally:
    await client.close()
    if not message_task.done():
      message_task.cancel()
      with contextlib.suppress(asyncio.CancelledError):
        await message_task

if __name__ == "__main__":
  asyncio.run(main())

Billing

  • Audio: Each second of audio input or output consumes 12.5 tokens.
  • Image: Every 28*28 pixels of input consumes 0.5 tokens.
  • Text: If you enable the source language speech recognition feature, the service returns the speech recognition text of the input audio (the original source language text) in addition to the translation result. This recognition text is billed based on the standard token rate for output text.
For token pricing, see the Model list.

Supported languages

The following language codes can be used for source and target languages. Some target languages support text output only.
Language codeLanguageSupported output
enEnglishAudio, text
zhChineseAudio, text
ruRussianAudio, text
frFrenchAudio, text
deGermanAudio, text
ptPortugueseAudio, text
esSpanishAudio, text
itItalianAudio, text
idIndonesianText
koKoreanAudio, text
jaJapaneseAudio, text
viVietnameseText
thThaiText
arArabicText
yueCantoneseAudio, text
hiHindiText
elGreekText
trTurkishText

Supported voices

Set the voice parameter when the output includes synthesized audio.
Voice namevoice parameterDescriptionSupported languages
CherryCherryA cheerful, friendly, and genuine young woman.Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
NofishNofishA designer who has difficulty pronouncing retroflex consonants.Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai-JadaJadaA bustling and energetic Shanghai lady.Chinese
Beijing-DylanDylanA young man who grew up in the hutongs of Beijing.Chinese
Sichuan-SunnySunnyA sweet girl from Sichuan.Chinese
Tianjin-PeterPeterA voice in the style of a Tianjin crosstalk performer (the supporting role).Chinese
Cantonese-KikiKikiA sweet best friend from Hong Kong.Cantonese
Sichuan-EricEricA man from Chengdu, Sichuan, who is unconventional and stands out from the crowd.Chinese
EthanEthanStandard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant.Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Alternative: Use Qwen-Omni

You can also use Qwen-Omni (qwen3-omni-flash-realtime) with a translation prompt for real-time audio and video translation via WebSocket.
Qwen-Omni-Realtime uses WebSocket for bidirectional streaming. For the full API and SDK reference, see Realtime audio and video understanding.

API reference