Realtime audio and video translation

Model details

qwen3-livetranslate-flash-realtime is a multilingual, real-time audio and video translation model. It can recognize 18 languages and translate them into audio in 11 languages in real time. Core features:

Multilingual support: Supports 18 languages, such as Chinese, English, French, German, Russian, Japanese, and Korean, and 6 Chinese dialects, including Mandarin, Cantonese, and Sichuanese.
Vision enhancement: Uses visual content to improve translation accuracy. The model analyzes lip movements, actions, and on-screen text to enhance translation in noisy environments or for words with multiple meanings.
3-second latency: Achieves simultaneous interpretation latency as low as 3 seconds.
Lossless simultaneous interpretation: Uses semantic unit prediction technology to resolve word order issues between languages. The real-time translation quality is close to that of offline translation.
Natural voice: Generates natural, human-like speech. The model automatically adjusts its tone and emotion based on the source audio.

Model	Version	Context window	Max input	Max output
qwen3-livetranslate-flash-realtime (Current capabilities are equivalent to qwen3-livetranslate-flash-realtime-2025-09-22)	Stable	53,248	49,152	4,096
qwen3-livetranslate-flash-realtime-2025-09-22	Snapshot	53,248	49,152	4,096

Getting started

Prepare the environment

Your Python version must be 3.10 or later. First, install pyaudio.

macOS
Debian/Ubuntu
CentOS
Windows

brew install portaudio && pip install pyaudio

After the installation is complete, use pip to install the required WebSocket dependencies:

pip install websocket-client==1.8.0 websockets

Create the client

Create a new Python file locally, name it livetranslate_client.py, and copy the following code into the file:

Client code - livetranslate_client.py

import os
import time
import base64
import asyncio
import json
import websockets
import pyaudio
import queue
import threading
import traceback

class LiveTranslateClient:
  def __init__(self, api_key: str, target_language: str = "en", voice: str | None = "Cherry", *, audio_enabled: bool = True):
    if not api_key:
      raise ValueError("API key cannot be empty.")

    self.api_key = api_key
    self.target_language = target_language
    self.audio_enabled = audio_enabled
    self.voice = voice if audio_enabled else "Cherry"
    self.ws = None
    self.api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"

    # Audio input configuration (from microphone)
    self.input_rate = 16000
    self.input_chunk = 1600
    self.input_format = pyaudio.paInt16
    self.input_channels = 1

    # Audio output configuration (for playback)
    self.output_rate = 24000
    self.output_chunk = 2400
    self.output_format = pyaudio.paInt16
    self.output_channels = 1

    # State management
    self.is_connected = False
    self.audio_player_thread = None
    self.audio_playback_queue = queue.Queue()
    self.pyaudio_instance = pyaudio.PyAudio()

  async def connect(self):
    """Establish a WebSocket connection to the translation service."""
    headers = {"Authorization": f"Bearer {self.api_key}"}
    try:
      self.ws = await websockets.connect(self.api_url, additional_headers=headers)
      self.is_connected = True
      print(f"Successfully connected to the server: {self.api_url}")
      await self.configure_session()
    except Exception as e:
      print(f"Connection failed: {e}")
      self.is_connected = False
      raise

  async def configure_session(self):
    """Configure the translation session, setting the target language, voice, and other parameters."""
    config = {
      "event_id": f"event_{int(time.time() * 1000)}",
      "type": "session.update",
      "session": {
        # 'modalities' controls the output type.
        # ["text", "audio"]: Returns both translated text and synthesized audio (recommended).
        # ["text"]: Returns only the translated text.
        "modalities": ["text", "audio"] if self.audio_enabled else ["text"],
        **({"voice": self.voice} if self.audio_enabled and self.voice else {}),
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm24",
        # 'input_audio_transcription' configures source language recognition.
        # Set 'model' to 'qwen3-asr-flash-realtime' to also output the source language recognition result.
        # "input_audio_transcription": {
        #     "model": "qwen3-asr-flash-realtime",
        #     "language": "zh"  # Source language, default is 'en'
        # },
        "translation": {
          "language": self.target_language
        }
      }
    }
    print(f"Sending session configuration: {json.dumps(config, indent=2, ensure_ascii=False)}")
    await self.ws.send(json.dumps(config))

  async def send_audio_chunk(self, audio_data: bytes):
    """Encode and send an audio data block to the server."""
    if not self.is_connected:
      return

    event = {
      "event_id": f"event_{int(time.time() * 1000)}",
      "type": "input_audio_buffer.append",
      "audio": base64.b64encode(audio_data).decode()
    }
    await self.ws.send(json.dumps(event))

  async def send_image_frame(self, image_bytes: bytes, *, event_id: str | None = None):
    # Send image data to the server
    if not self.is_connected:
      return

    if not image_bytes:
      raise ValueError("image_bytes cannot be empty")

    # Encode to Base64
    image_b64 = base64.b64encode(image_bytes).decode()

    event = {
      "event_id": event_id or f"event_{int(time.time() * 1000)}",
      "type": "input_image_buffer.append",
      "image": image_b64,
    }

    await self.ws.send(json.dumps(event))

  def _audio_player_task(self):
    stream = self.pyaudio_instance.open(
      format=self.output_format,
      channels=self.output_channels,
      rate=self.output_rate,
      output=True,
      frames_per_buffer=self.output_chunk,
    )
    try:
      while self.is_connected or not self.audio_playback_queue.empty():
        try:
          audio_chunk = self.audio_playback_queue.get(timeout=0.1)
          if audio_chunk is None: # Termination signal
            break
          stream.write(audio_chunk)
          self.audio_playback_queue.task_done()
        except queue.Empty:
          continue
    finally:
      stream.stop_stream()
      stream.close()

  def start_audio_player(self):
    """Start the audio player thread (only when audio output is enabled)."""
    if not self.audio_enabled:
      return
    if self.audio_player_thread is None or not self.audio_player_thread.is_alive():
      self.audio_player_thread = threading.Thread(target=self._audio_player_task, daemon=True)
      self.audio_player_thread.start()

  async def handle_server_messages(self, on_text_received):
    """Loop to process messages from the server."""
    try:
      async for message in self.ws:
        event = json.loads(message)
        event_type = event.get("type")
        if event_type == "response.audio.delta" and self.audio_enabled:
          audio_b64 = event.get("delta", "")
          if audio_b64:
            audio_data = base64.b64decode(audio_b64)
            self.audio_playback_queue.put(audio_data)

        elif event_type == "response.done":
          print("\n[INFO] A round of response is complete.")
          usage = event.get("response", {}).get("usage", {})
          if usage:
            print(f"[INFO] Token usage: {json.dumps(usage, indent=2, ensure_ascii=False)}")
        # Process source language recognition results (requires enabling input_audio_transcription.model)
        # elif event_type == "conversation.item.input_audio_transcription.text":
        #     stash = event.get("stash", "")  # Unconfirmed recognition text
        #     print(f"[Recognizing] {stash}")
        # elif event_type == "conversation.item.input_audio_transcription.completed":
        #     transcript = event.get("transcript", "")  # Complete recognition result
        #     print(f"[Source language] {transcript}")
        elif event_type == "response.audio_transcript.done":
          print("\n[INFO] Text translation complete.")
          text = event.get("transcript", "")
          if text:
            print(f"[INFO] Translated text: {text}")
        elif event_type == "response.text.done":
          print("\n[INFO] Text translation complete.")
          text = event.get("text", "")
          if text:
            print(f"[INFO] Translated text: {text}")

    except websockets.exceptions.ConnectionClosed as e:
      print(f"[WARNING] Connection closed: {e}")
      self.is_connected = False
    except Exception as e:
      print(f"[ERROR] An unknown error occurred while processing messages: {e}")
      traceback.print_exc()
      self.is_connected = False

  async def start_microphone_streaming(self):
    """Capture audio from the microphone and stream it to the server."""
    stream = self.pyaudio_instance.open(
      format=self.input_format,
      channels=self.input_channels,
      rate=self.input_rate,
      input=True,
      frames_per_buffer=self.input_chunk
    )
    print("Microphone is on. Start speaking...")
    try:
      while self.is_connected:
        audio_chunk = await asyncio.get_event_loop().run_in_executor(
          None, stream.read, self.input_chunk
        )
        await self.send_audio_chunk(audio_chunk)
    finally:
      stream.stop_stream()
      stream.close()

  async def close(self):
    """Gracefully close the connection and release resources."""
    self.is_connected = False
    if self.ws:
      await self.ws.close()
      print("WebSocket connection closed.")

    if self.audio_player_thread:
      self.audio_playback_queue.put(None) # Send termination signal
      self.audio_player_thread.join(timeout=1)
      print("Audio player thread stopped.")

    self.pyaudio_instance.terminate()
    print("PyAudio instance released.")

Interact with the model

In the same folder as livetranslate_client.py, create another Python file, name it main.py, and copy the following code into the file:

main.py

import os
import asyncio
from livetranslate_client import LiveTranslateClient

def print_banner():
  print("=" * 60)
  print("  Powered by Qwen qwen3-livetranslate-flash-realtime")
  print("=" * 60 + "\n")

def get_user_config():
  """Get user configuration"""
  print("Select a mode:")
  print("1. Voice + Text [Default] | 2. Text Only")
  mode_choice = input("Enter your choice (press Enter for Voice + Text): ").strip()
  audio_enabled = (mode_choice != "2")

  if audio_enabled:
    lang_map = {
      "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt",
      "7": "es", "8": "it", "9": "ko", "10": "ja", "11": "yue"
    }
    print("Select the target translation language (Voice + Text mode):")
    print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Korean | 10. Japanese | 11. Cantonese")
  else:
    lang_map = {
      "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt", "7": "es", "8": "it",
      "9": "id", "10": "ko", "11": "ja", "12": "vi", "13": "th", "14": "ar",
      "15": "yue", "16": "hi", "17": "el", "18": "tr"
    }
    print("Select the target translation language (Text Only mode):")
    print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Indonesian | 10. Korean | 11. Japanese | 12. Vietnamese | 13. Thai | 14. Arabic | 15. Cantonese | 16. Hindi | 17. Greek | 18. Turkish")

  choice = input("Enter your choice (defaults to the first option): ").strip()
  target_language = lang_map.get(choice, next(iter(lang_map.values())))

  voice = None
  if audio_enabled:
    print("\nSelect a speech synthesis voice:")
    voice_map = {"1": "Cherry", "2": "Nofish", "3": "Sunny", "4": "Jada", "5": "Dylan", "6": "Peter", "7": "Eric", "8": "Kiki", "9": "Ethan"}
    print("1. Cherry (Female) [Default] | 2. Nofish (Male) | 3. Sunny (Sichuan Female) | 4. Jada (Shanghai Female) | 5. Dylan (Beijing Male) | 6. Peter (Tianjin Male) | 7. Eric (Sichuan Male) | 8. Kiki (Cantonese Female) | 9. Ethan (Male)")
    voice_choice = input("Enter your choice (press Enter for Cherry): ").strip()
    voice = voice_map.get(voice_choice, "Cherry")
  return target_language, voice, audio_enabled

async def main():
  """Main program entry point"""
  print_banner()

  api_key = os.environ.get("DASHSCOPE_API_KEY")
  if not api_key:
    print("[ERROR] Set the DASHSCOPE_API_KEY environment variable.")
    print("  For example: export DASHSCOPE_API_KEY='YOUR_API_KEY'")
    return

  target_language, voice, audio_enabled = get_user_config()
  print("\nConfiguration complete:")
  print(f"  - Target language: {target_language}")
  if audio_enabled:
    print(f"  - Synthesized voice: {voice}")
  else:
    print("  - Output mode: Text Only")

  client = LiveTranslateClient(api_key=api_key, target_language=target_language, voice=voice, audio_enabled=audio_enabled)

  # Define the callback function
  def on_translation_text(text):
    print(text, end="", flush=True)

  try:
    print("Connecting to the translation service...")
    await client.connect()

    # Start audio playback based on the mode
    client.start_audio_player()

    print("\n" + "-" * 60)
    print("Connection successful! Speak into the microphone.")
    print("The program will translate your speech in real time and play the result. Press Ctrl+C to exit.")
    print("-" * 60 + "\n")

    # Run message handling and microphone recording concurrently
    message_handler = asyncio.create_task(client.handle_server_messages(on_translation_text))
    tasks = [message_handler]
    # Audio must be captured from the microphone for translation, regardless of whether audio output is enabled
    microphone_streamer = asyncio.create_task(client.start_microphone_streaming())
    tasks.append(microphone_streamer)

    await asyncio.gather(*tasks)

  except KeyboardInterrupt:
    print("\n\nUser interrupted. Exiting...")
  except Exception as e:
    print(f"\nA critical error occurred: {e}")
  finally:
    print("\nCleaning up resources...")
    await client.close()
    print("Program exited.")

if __name__ == "__main__":
  asyncio.run(main())

Run main.py and speak the sentences you want to translate into the microphone. The model outputs the translated audio and text in real time. The system automatically detects your speech and sends the audio to the server, so no manual action is required.

DashScope SDK

As an alternative to the WebSocket approach above, you can use the DashScope SDK for a higher-level interface.

Python
Java

Prerequisites

Install the DashScope SDK (version 1.25.6 or later)
Get an API key
Install PyAudio for audio capture and playback: pip install pyaudio

Complete example

Record and translate audio from a microphone in real time:

Real-time translation from a microphone (DashScope SDK)

import os
import sys
import base64
import signal
import pyaudio
from dashscope.audio.qwen_omni import (
  OmniRealtimeConversation,
  OmniRealtimeCallback,
  MultiModality,
)
from dashscope.audio.qwen_omni.omni_realtime import TranslationParams


class Callback(OmniRealtimeCallback):
  """Callback handler class for real-time translation"""

  def __init__(self, speaker):
    self.speaker = speaker

  def on_open(self):
    print("[Connection established]")

  def on_close(self, code, msg):
    print(f"[Connection closed] code: {code}, msg: {msg}")

  def on_event(self, response):
    event_type = response.get("type", "")
    if event_type == "input_audio_buffer.speech_started":
      print("====== Speech input detected ======")
    elif event_type == "input_audio_buffer.speech_stopped":
      print("====== Speech input ended ======")
    elif event_type == "conversation.item.input_audio_transcription.text":
      # text: confirmed text, stash: temporary text being processed
      print(f"[Original text] {response.get('text', '')}{response.get('stash', '')}")
    elif event_type == "response.audio_transcript.text":
      # text: confirmed text, stash: temporary text being processed
      print(f"[Translation result] {response.get('text', '')}{response.get('stash', '')}")
    elif event_type == "response.audio.delta":
      audio_b64 = response.get("delta", "")
      if audio_b64:
        self.speaker.write(base64.b64decode(audio_b64))
    elif event_type == "error":
      print(f"[Error] {response.get('error', {}).get('message', '')}")


def main():
  # Check for the API key.
  if not os.environ.get("DASHSCOPE_API_KEY"):
    print("Set the DASHSCOPE_API_KEY environment variable.")
    sys.exit(1)

  # Initialize PyAudio.
  pya = pyaudio.PyAudio()

  # Initialize the speaker for playing back the translated audio.
  speaker = pya.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=24000,
    output=True,
    frames_per_buffer=2400
  )

  # Initialize the microphone for capturing speech input.
  mic = pya.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,
    input=True,
    frames_per_buffer=1600
  )

  # Create a callback instance.
  callback = Callback(speaker=speaker)

  # Create a real-time session.
  conversation = OmniRealtimeConversation(
    model="qwen3-livetranslate-flash-realtime",
    url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
    callback=callback
  )

  # Connect to the server.
  conversation.connect()

  # Configure translation parameters.
  translation_params = TranslationParams(
    language="en",  # Target language for translation: English
    corpus=TranslationParams.Corpus(
      phrases={
        "Source Term 1": "Target Translation 1",
        "Source Term 2": "Target Translation 2"
      }
    )
  )

  # Update the session configuration.
  conversation.update_session(
    output_modalities=[MultiModality.TEXT, MultiModality.AUDIO],
    input_audio_transcription_model="qwen3-asr-flash-realtime",
    voice="Cherry",
    translation_params=translation_params,
  )

  # Register the exit signal handler.
  def on_exit(sig, frame):
    print("\n[Exiting...]")
    mic.stop_stream()
    mic.close()
    speaker.stop_stream()
    speaker.close()
    pya.terminate()
    conversation.close()
    sys.exit(0)

  signal.signal(signal.SIGINT, on_exit)

  print("[Starting real-time translation] Speak into the microphone. Press Ctrl+C to exit.")

  # Continuously capture and send microphone audio.
  while True:
    audio_data = mic.read(1600, exception_on_overflow=False)
    conversation.append_audio(base64.b64encode(audio_data).decode("ascii"))


if __name__ == "__main__":
  main()

Prerequisites

Install the SDK

Install the DashScope SDK, version 2.22.5 or later.

Get an API key

Get an API key.

Set the API key

Linux / macOS
Windows

export DASHSCOPE_API_KEY=YOUR_API_KEY
source ~/.bashrc

Getting started

Connect, stream audio, and receive translations:

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import java.util.Arrays;

// 1. Build connection parameters
OmniRealtimeParam param = OmniRealtimeParam.builder()
    .model("qwen3-livetranslate-flash-realtime")
    .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
    .apikey(System.getenv("DASHSCOPE_API_KEY"))
    .build();

// 2. Define a callback to handle server events
OmniRealtimeCallback callback = new OmniRealtimeCallback() {
  @Override public void onOpen() { System.out.println("Connected"); }

  @Override
  public void onEvent(JsonObject message) {
    String type = message.get("type").getAsString();
    if ("response.audio_transcript.done".equals(type)) {
      System.out.println("Translation: " + message.get("transcript").getAsString());
    }
  }

  @Override
  public void onClose(int code, String reason) {
    System.out.println("Closed: " + code + " " + reason);
  }
};

// 3. Open a session and start streaming audio
OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, callback);
conversation.connect();

// 4. Configure translation target language
OmniRealtimeConfig config = OmniRealtimeConfig.builder()
    .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
    .translationConfig(OmniRealtimeTranslationParam.builder()
        .language("en")
        .build())
    .build();
conversation.updateSession(config);

// 5. Send Base64-encoded audio chunks (PCM 16 kHz, 16-bit, mono)
conversation.appendAudio(audioBase64);

// 6. End the session when done
conversation.endSession();

Complete example

This example captures microphone audio, translates it in real time, and plays the translated speech through the speaker.

Real-time microphone translation (DashScope SDK)

import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;

import javax.sound.sampled.*;
import java.util.*;
import java.util.concurrent.atomic.AtomicBoolean;

/**
 * Example of using a microphone with the real-time audio and video translation model.
 */
public class Main {
    private static final int INPUT_CHUNK_SIZE = 3200;   // 100 ms of 16 kHz, 16-bit, mono audio
    private static final int OUTPUT_CHUNK_SIZE = 4800;  // 100 ms of 24 kHz, 16-bit, mono audio
    private static final AtomicBoolean running = new AtomicBoolean(true);
    private static SourceDataLine speaker;  // Speaker

    public static void main(String[] args) throws InterruptedException {
        String apiKey = System.getenv("DASHSCOPE_API_KEY");
        if (apiKey == null || apiKey.isEmpty()) {
            System.err.println("Set the DASHSCOPE_API_KEY environment variable.");
            System.exit(1);
        }

        // Create connection parameters.
        OmniRealtimeParam param = OmniRealtimeParam.builder()
                .model("qwen3-livetranslate-flash-realtime")
                .url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
                .apikey(apiKey)
                .build();

        // Create a callback handler.
        OmniRealtimeCallback callback = new OmniRealtimeCallback() {
            @Override
            public void onOpen() {
                System.out.println("[Connection established]");
            }

            @Override
            public void onEvent(JsonObject message) {
                String type = message.get("type").getAsString();
                switch (type) {
                    case "input_audio_buffer.speech_started":
                        System.out.println("====== Speech input detected ======");
                        break;
                    case "input_audio_buffer.speech_stopped":
                        System.out.println("====== Speech input ended ======");
                        break;
                    case "conversation.item.input_audio_transcription.completed":
                        String originalText = message.get("transcript").getAsString();
                        System.out.println("[Original text] " + originalText);
                        break;
                    case "response.audio_transcript.done":
                        String translatedText = message.get("transcript").getAsString();
                        System.out.println("[Translation result] " + translatedText);
                        break;
                    case "response.audio.delta":
                        // Decode and play the translated audio.
                        String audioB64 = message.get("delta").getAsString();
                        byte[] audioBytes = Base64.getDecoder().decode(audioB64);
                        if (speaker != null) {
                            speaker.write(audioBytes, 0, audioBytes.length);
                        }
                        break;
                    case "error":
                        JsonObject error = message.get("error").getAsJsonObject();
                        System.err.println("[Error] " + error.get("message").getAsString());
                        break;
                }
            }

            @Override
            public void onClose(int code, String reason) {
                System.out.println("[Connection closed] code: " + code + ", reason: " + reason);
            }
        };

        // Create a session.
        OmniRealtimeConversation conversation = new OmniRealtimeConversation(param, callback);

        try {
            // Initialize the speaker (for playing the translated speech).
            AudioFormat speakerFormat = new AudioFormat(24000, 16, 1, true, false);
            DataLine.Info speakerInfo = new DataLine.Info(SourceDataLine.class, speakerFormat);
            speaker = (SourceDataLine) AudioSystem.getLine(speakerInfo);
            speaker.open(speakerFormat, OUTPUT_CHUNK_SIZE * 4);
            speaker.start();

            // Initialize the microphone (for capturing speech input).
            AudioFormat micFormat = new AudioFormat(16000, 16, 1, true, false);
            DataLine.Info micInfo = new DataLine.Info(TargetDataLine.class, micFormat);
            if (!AudioSystem.isLineSupported(micInfo)) {
                System.err.println("Microphone is not available.");
                System.exit(1);
            }
            TargetDataLine microphone = (TargetDataLine) AudioSystem.getLine(micInfo);
            microphone.open(micFormat);
            microphone.start();

            // Connect to the server.
            conversation.connect();

            // Configure translation parameters.
            Map<String, Object> phrases = new HashMap<>();
            phrases.put("Inteligencia Artificial", "Artificial Intelligence");
            phrases.put("Aprendizaje Automático", "Machine Learning");

            OmniRealtimeConfig config = OmniRealtimeConfig.builder()
                    .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
                    .voice("Cherry")
                    .inputAudioFormat(OmniRealtimeAudioFormat.PCM_16000HZ_MONO_16BIT)
                    .outputAudioFormat(OmniRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT)
                    .InputAudioTranscription("qwen3-asr-flash-realtime")
                    .translationConfig(OmniRealtimeTranslationParam.builder()
                            .language("en")
                            .corpus(OmniRealtimeTranslationParam.Corpus.builder()
                                    .phrases(phrases)
                                    .build())
                            .build())
                    .build();

            conversation.updateSession(config);

            // Register a shutdown hook.
            Runtime.getRuntime().addShutdownHook(new Thread(() -> {
                System.out.println("\n[Exiting...]");
                running.set(false);
                microphone.stop();
                microphone.close();
                speaker.stop();
                speaker.close();
                conversation.close(1000, "User stopped");
            }));

            System.out.println("[Starting real-time translation] Speak into the microphone. Press Ctrl+C to exit.");

            // Continuously capture and send microphone audio.
            byte[] buffer = new byte[INPUT_CHUNK_SIZE];
            while (running.get()) {
                int bytesRead = microphone.read(buffer, 0, buffer.length);
                if (bytesRead > 0) {
                    conversation.appendAudio(Base64.getEncoder().encodeToString(buffer));
                }
            }

        } catch (NoApiKeyException e) {
            System.err.println("API Key error: " + e.getMessage());
        } catch (Exception e) {
            System.err.println("An exception occurred: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Request parameters

Configure the connection

qwen3-livetranslate-flash-realtime connects using the WebSocket protocol. The connection requires the following configuration items:

Configuration	Description
Endpoint	`wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime`
Query parameter	The query parameter is `model`. Set it to the name of the model you want to access. Example: `?model=qwen3-livetranslate-flash-realtime`
Message header	Use Bearer Token for authentication: `Authorization: Bearer $DASHSCOPE_API_KEY`. DASHSCOPE_API_KEY is the API key that you request from Qwen Cloud.

Use the following Python sample code to establish a connection.

WebSocket connection Python sample code

# pip install websocket-client
import json
import websocket
import os

API_KEY=os.getenv("DASHSCOPE_API_KEY")
API_URL = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-livetranslate-flash-realtime"

headers = [
  "Authorization: Bearer " + API_KEY
]

def on_open(ws):
  print(f"Connected to server: {API_URL}")
def on_message(ws, message):
  data = json.loads(message)
  print("Received event:", json.dumps(data, indent=2))
def on_error(ws, error):
  print("Error:", error)

ws = websocket.WebSocketApp(
  API_URL,
  header=headers,
  on_open=on_open,
  on_message=on_message,
  on_error=on_error
)

ws.run_forever()

Set the language, output modality, and voice

Send the client event session.update:

Language
- Source language: Use the session.input_audio_transcription.language parameter.
  The default value is en (English).
- Target language: Use the session.translation.language parameter.
  The default value is en (English).
See Supported languages.
Output source language recognition results Use the session.input_audio_transcription.model parameter. When you set the parameter to qwen3-asr-flash-realtime, the server returns the speech recognition result of the input audio (the original source language text) in addition to the translation. When this feature is enabled, the server returns the following events:
- conversation.item.input_audio_transcription.text: Returns the recognition result as a stream.
- conversation.item.input_audio_transcription.completed: Returns the final result after recognition is complete.
Output modality Use the session.modalities parameter. Supported values are ["text"] (text only) and ["text","audio"] (text and audio).
Voice Use the session.voice parameter. See Supported voices.

Input audio and images

The client sends Base64-encoded audio and image data using the input_audio_buffer.append and input_image_buffer.append events. Audio input is required. Image input is optional.

Images can be from local files or a real-time video stream. The server automatically detects the start and end of the audio and triggers a model response.

Receive the model response

When the server detects the end of the audio, the model responds. The response format depends on the configured output modality.

Text-only output The server returns the complete translated text in a response.text.done event.
Text and audio output
- Text: The complete translated text is returned in a response.audio_transcript.done event.
- Audio: Incremental, Base64-encoded audio data is returned in response.audio.delta events.

Parse the response

The interaction flow for real-time speech translation follows the standard WebSocket event-driven model, where the server automatically detects the start and end of speech and responds.

Lifecycle	Client event	Server event
Session initialization	session.update (Session configuration)	session.created (Session created), session.updated (Session configuration updated)
User audio input	input_audio_buffer.append (Add audio to buffer), input_image_buffer.append (Add image to buffer)	None
Server audio output	None	response.created (Server starts generating response), response.output_item.added (New output content in response), response.content_part.added (New output content added to assistant message), response.audio_transcript.text (Incrementally generated transcript text), response.audio.delta (Incrementally generated audio from the model), response.audio_transcript.done (Text transcription complete), response.audio.done (Audio generation complete), response.content_part.done (Streaming of text or audio content for the assistant message is complete), response.output_item.done (Streaming of the entire output item for the assistant message is complete), response.done (Response complete)

Use images to improve translation accuracy

qwen3-livetranslate-flash-realtime can accept image input to assist with audio translation. This is useful for scenarios involving homonyms or recognizing uncommon proper nouns. You can send a maximum of two images per second. Download the following sample images to your local computer: medical mask.png and masquerade mask.png Download the following code to the same folder as livetranslate_client.py and run it. Say "What is mask?" into the microphone. When you input the medical mask image, the model translates the phrase as "What is a medical mask?". When you input the masquerade mask image, the model translates the phrase as "What is a masquerade mask?".

import os
import time
import json
import asyncio
import contextlib
import functools

from livetranslate_client import LiveTranslateClient

IMAGE_PATH = "mask_medical.png"
# IMAGE_PATH = "mask_masquerade.png"

def print_banner():
  print("=" * 60)
  print("  Powered by Qwen qwen3-livetranslate-flash-realtime - Single-turn interaction example (mask)")
  print("=" * 60 + "\n")

async def stream_microphone_once(client: LiveTranslateClient, image_bytes: bytes):
  pa = client.pyaudio_instance
  stream = pa.open(
    format=client.input_format,
    channels=client.input_channels,
    rate=client.input_rate,
    input=True,
    frames_per_buffer=client.input_chunk,
  )
  print(f"[INFO] Recording started. Please speak...")
  loop = asyncio.get_event_loop()
  last_img_time = 0.0
  frame_interval = 0.5  # 2 fps
  try:
    while client.is_connected:
      data = await loop.run_in_executor(None, stream.read, client.input_chunk)
      await client.send_audio_chunk(data)

      # Append an image frame every 0.5 seconds
      now = time.time()
      if now - last_img_time >= frame_interval:
        await client.send_image_frame(image_bytes)
        last_img_time = now
  finally:
    stream.stop_stream()
    stream.close()

async def main():
  print_banner()
  api_key = os.environ.get("DASHSCOPE_API_KEY")
  if not api_key:
    print("[ERROR] First, configure the API KEY in the DASHSCOPE_API_KEY environment variable.")
    return

  client = LiveTranslateClient(api_key=api_key, target_language="zh", voice="Cherry", audio_enabled=True)

  def on_text(text: str):
    print(text, end="", flush=True)

  try:
    await client.connect()
    client.start_audio_player()
    message_task = asyncio.create_task(client.handle_server_messages(on_text))
    with open(IMAGE_PATH, "rb") as f:
      img_bytes = f.read()
    await stream_microphone_once(client, img_bytes)
    await asyncio.sleep(15)
  finally:
    await client.close()
    if not message_task.done():
      message_task.cancel()
      with contextlib.suppress(asyncio.CancelledError):
        await message_task

if __name__ == "__main__":
  asyncio.run(main())

Billing

Audio: Each second of audio input or output consumes 12.5 tokens.
Image: Every 28*28 pixels of input consumes 0.5 tokens.
Text: If you enable the source language speech recognition feature, the service returns the speech recognition text of the input audio (the original source language text) in addition to the translation result. This recognition text is billed based on the standard token rate for output text.

For token pricing, see the Model list.

Supported languages

The following language codes can be used for source and target languages. Some target languages support text output only.

Language code	Language	Supported output
en	English	Audio, text
zh	Chinese	Audio, text
ru	Russian	Audio, text
fr	French	Audio, text
de	German	Audio, text
pt	Portuguese	Audio, text
es	Spanish	Audio, text
it	Italian	Audio, text
id	Indonesian	Text
ko	Korean	Audio, text
ja	Japanese	Audio, text
vi	Vietnamese	Text
th	Thai	Text
ar	Arabic	Text
yue	Cantonese	Audio, text
hi	Hindi	Text
el	Greek	Text
tr	Turkish	Text

Supported voices

Set the voice parameter when the output includes synthesized audio.

Voice name	`voice` parameter	Description	Supported languages
Cherry	Cherry	A cheerful, friendly, and genuine young woman.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Nofish	Nofish	A designer who has difficulty pronouncing retroflex consonants.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean
Shanghai-Jada	Jada	A bustling and energetic Shanghai lady.	Chinese
Beijing-Dylan	Dylan	A young man who grew up in the hutongs of Beijing.	Chinese
Sichuan-Sunny	Sunny	A sweet girl from Sichuan.	Chinese
Tianjin-Peter	Peter	A voice in the style of a Tianjin crosstalk performer (the supporting role).	Chinese
Cantonese-Kiki	Kiki	A sweet best friend from Hong Kong.	Cantonese
Sichuan-Eric	Eric	A man from Chengdu, Sichuan, who is unconventional and stands out from the crowd.	Chinese
Ethan	Ethan	Standard Mandarin with a slight northern accent. Sunny, warm, energetic, and vibrant.	Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean

Alternative: Use Qwen-Omni

You can also use Qwen-Omni (qwen3-omni-flash-realtime) with a translation prompt for real-time audio and video translation via WebSocket.

Qwen-Omni-Realtime uses WebSocket for bidirectional streaming. For the full API and SDK reference, see Realtime audio and video understanding.

​Model details

​Getting started

​Prepare the environment

​Create the client

​Interact with the model

​DashScope SDK

Prerequisites

Complete example

​Request parameters

​Configure the connection

​Set the language, output modality, and voice

​Input audio and images

​Receive the model response

​Parse the response

​Use images to improve translation accuracy

​Billing

​Supported languages

​Supported voices

​Alternative: Use Qwen-Omni

​API reference

Model details

Getting started

Prepare the environment

Create the client

Interact with the model

DashScope SDK

Request parameters

Configure the connection

Set the language, output modality, and voice

Input audio and images

Receive the model response

Parse the response

Use images to improve translation accuracy

Billing

Supported languages

Supported voices

Alternative: Use Qwen-Omni

API reference