Skip to main content
Text-to-speech

Real-time speech synthesis

Stream TTS in real time

Qwen Cloud provides two families of real-time speech synthesis models: CosyVoice for streaming synthesis with SSML control, and Qwen-TTS-Realtime for real-time synthesis with instruction-based voice control, voice cloning, and voice design.

Core features

  • Generates high-fidelity speech in real time with multilingual support, including Chinese and English
  • Supports voice customization through Qwen-TTS-Realtime voice cloning and voice design
  • Supports streaming input and output with low first-packet latency, ideal for real-time conversational scenarios
  • Provides fine-grained control over speech rate, pitch, volume, and bitrate
  • Supports mainstream audio formats (PCM, WAV, MP3, Opus) with output sample rates up to 48 kHz
  • Supports instruction control, enabling natural language instructions to control vocal expressiveness

Supported models

  • CosyVoice
  • Qwen-TTS-Realtime
  • CosyVoice: cosyvoice-v3-plus, cosyvoice-v3-flash

Getting started

  • CosyVoice
  • Qwen-TTS-Realtime
For more code examples, see GitHub.Get an API key and set it as an environment variable. To use the SDK, install it.
  • Use system voices
For non-realtime synthesis (send complete text, receive full audio), see Non-realtime speech synthesis.
Convert LLM-generated text to speech in real time and play it through speakers
Play text from a Qwen model (qwen3.5-flash) as speech in real time on a local device.
  • Python
  • Java
Before you run the Python example, install a third-party audio playback library using pip.
# coding=utf-8
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import os
import pyaudio
import dashscope
from dashscope.audio.tts_v2 import *


from http import HTTPStatus
from dashscope import Generation

# If you have not configured environment variables, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# cosyvoice-v3-flash/cosyvoice-v3-plus: Use voices such as longanyang.
# Each voice supports different languages. When synthesizing non-Chinese languages such as Japanese or Korean, select a voice that supports the corresponding language. For more information, see the CosyVoice voice list.
model = "cosyvoice-v3-flash"
voice = "longanyang"


class Callback(ResultCallback):
  _player = None
  _stream = None

  def on_open(self):
    print("websocket is open.")
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(
      format=pyaudio.paInt16, channels=1, rate=22050, output=True
    )

  def on_complete(self):
    print("speech synthesis task complete successfully.")

  def on_error(self, message: str):
    print(f"speech synthesis task failed, {message}")

  def on_close(self):
    print("websocket is closed.")
    # stop player
    self._stream.stop_stream()
    self._stream.close()
    self._player.terminate()

  def on_event(self, message):
    print(f"recv speech synthsis message {message}")

  def on_data(self, data: bytes) -> None:
    print("audio result length:", len(data))
    self._stream.write(data)


def synthesizer_with_llm():
  callback = Callback()
  synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    format=AudioFormat.PCM_22050HZ_MONO_16BIT,
    callback=callback,
  )

  messages = [{"role": "user", "content": "Please introduce yourself"}]
  responses = Generation.call(
    model="qwen3.5-flash",
    messages=messages,
    result_format="message",  # set result format as 'message'
    stream=True,  # enable stream output
    incremental_output=True,  # enable incremental output
  )
  for response in responses:
    if response.status_code == HTTPStatus.OK:
      print(response.output.choices[0]["message"]["content"], end="")
      synthesizer.streaming_call(response.output.choices[0]["message"]["content"])
    else:
      print(
        "Request id: %s, Status code: %s, error code: %s, error message: %s"
        % (
          response.request_id,
          response.status_code,
          response.code,
          response.message,
        )
      )
  synthesizer.streaming_complete()
  print('requestId: ', synthesizer.get_last_request_id())


if __name__ == "__main__":
  synthesizer_with_llm()
Stream audio via callback
Send full text and receive audio data incrementally through a callback function. This approach is ideal for short text when you need low-latency audio output without blocking the main thread.
  • Python
  • Java
# coding=utf-8

import os
import dashscope
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
  now = datetime.now()
  formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
  return formatted_timestamp

# If you have not configured environment variables, replace the next line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define callback interface
class Callback(ResultCallback):
  _player = None
  _stream = None

  def on_open(self):
    self.file = open("output.mp3", "wb")
    print("Connection established: " + get_timestamp())

  def on_complete(self):
    print("Speech synthesis completed, all results received: " + get_timestamp())
    # Call get_first_package_delay only after on_complete triggers
    # First request includes WebSocket connection setup time
    print('[Metric] requestId: {}, first-package delay: {} ms'.format(
      synthesizer.get_last_request_id(),
      synthesizer.get_first_package_delay()))

  def on_error(self, message: str):
    print(f"Speech synthesis error: {message}")

  def on_close(self):
    print("Connection closed: " + get_timestamp())
    self.file.close()

  def on_event(self, message):
    pass

  def on_data(self, data: bytes) -> None:
    print(get_timestamp() + " Binary audio length: " + str(len(data)))
    self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(
  model=model,
  voice=voice,
  callback=callback,
)

# Send text for synthesis and retrieve binary audio in real time through the on_data method of the callback interface
synthesizer.call("How is the weather today?")
Synthesize streaming text in real time
Send text fragments incrementally and receive audio in real time through callbacks. This bidirectional streaming approach is best for long text or integrating with LLM output where text arrives in chunks.
  • Python
  • Java
# coding=utf-8
#
# PyAudio installation instructions:
# For macOS, run:
#   brew install portaudio
#   pip install pyaudio
# For Debian/Ubuntu, run:
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# For CentOS, run:
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# For Microsoft Windows, run:
#   python -m pip install pyaudio

import os
import time
import pyaudio
import dashscope
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
  now = datetime.now()
  formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
  return formatted_timestamp

# If you have not configured environment variables, replace the next line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define callback interface
class Callback(ResultCallback):
  _player = None
  _stream = None

  def on_open(self):
    print("Connection established: " + get_timestamp())
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(
      format=pyaudio.paInt16, channels=1, rate=22050, output=True
    )

  def on_complete(self):
    print("Speech synthesis completed, all results received: " + get_timestamp())

  def on_error(self, message: str):
    print(f"Speech synthesis error: {message}")

  def on_close(self):
    print("Connection closed: " + get_timestamp())
    # Stop player
    self._stream.stop_stream()
    self._stream.close()
    self._player.terminate()

  def on_event(self, message):
    pass

  def on_data(self, data: bytes) -> None:
    print(get_timestamp() + " Binary audio length: " + str(len(data)))
    self._stream.write(data)


callback = Callback()

test_text = [
  "Streaming text-to-speech SDK,",
  "converts input text",
  "into binary audio data.",
  "Compared with non-streaming speech synthesis,",
  "streaming synthesis offers better real-time performance.",
  "Users hear near-synchronous audio output while typing,",
  "greatly improving interaction experience",
  "and reducing wait time.",
  "Ideal for large language model (LLM) integration,",
  "where text is streamed for speech synthesis.",
]

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(
  model=model,
  voice=voice,
  format=AudioFormat.PCM_22050HZ_MONO_16BIT,
  callback=callback,
)


# Stream text for synthesis. Retrieve binary audio in real time through the on_data method of the callback interface
for text in test_text:
  synthesizer.streaming_call(text)
  time.sleep(0.1)
# End streaming speech synthesis
synthesizer.streaming_complete()

# First request includes WebSocket connection setup time
print('[Metric] requestId: {}, first-package delay: {} ms'.format(
  synthesizer.get_last_request_id(),
  synthesizer.get_first_package_delay()))

Interaction flow

  • CosyVoice
  • Qwen-TTS-Realtime
CosyVoice uses a WebSocket-based streaming protocol. For protocol details, see the CosyVoice WebSocket API reference.

Instruction control

  • CosyVoice
  • Qwen-TTS-Realtime
CosyVoice instruction control is available only with cosyvoice-v3-flash. Use SSML for fine-grained pronunciation and prosody control with other CosyVoice models.

Voice customization

  • CosyVoice
  • Qwen-TTS-Realtime

Voice cloning: Input audio formats

High-quality input audio is the foundation for achieving excellent cloning results.
ItemRequirements
Supported formatsWAV (16-bit), MP3, M4A
Audio durationRecommended: 10 to 20 seconds. Maximum: 60 seconds.
File size≤ 10 MB
Sample rate≥ 16 kHz
Sound channelMono or stereo. For stereo audio, only the first channel is processed. Make sure that the first channel contains a clear human voice.
ContentThe audio must contain at least 5 seconds of continuous, clear speech without background sound. The rest of the audio can have only short pauses (≤ 2 seconds). The entire audio segment should be free of background music, noise, or other voices to ensure high-quality core speech content. Use normal spoken audio as input. Do not upload songs or singing audio to ensure accuracy and usability of the cloning effect.

Voice design: Write high-quality voice descriptions

Limitations

When writing voice descriptions (voice_prompt), follow these technical constraints:
  • Length limit: The content of voice_prompt must not exceed 500 characters.
  • Supported languages: The description text supports only Chinese and English.

Core principles

The voice_prompt guides the model to generate voices with specific characteristics.Follow these core principles when describing voices:
  • Be specific, not vague: Use words that describe concrete sound qualities, such as "deep," "crisp," or "fast-paced." Avoid subjective, uninformative terms such as "nice-sounding" or "ordinary."
  • Be multidimensional, not single-dimensional: Excellent descriptions typically combine multiple dimensions, such as gender, age, and emotion. Single-dimensional descriptions, such as "female voice," are too broad to generate distinctive voices.
  • Be objective, not subjective: Focus on the physical and perceptual characteristics of the sound itself, not your personal preferences. For example, use "high-pitched with energetic delivery" instead of "my favorite voice."
  • Be original, not imitative: Describe sound characteristics rather than requesting imitation of specific individuals, such as celebrities or actors. Such requests pose copyright risks, and the model does not support direct imitation.
  • Be concise, not redundant: Ensure every word adds meaning. Avoid repeating synonyms or using meaningless intensifiers, such as "very very nice voice."

Dimension example

DimensionExample
GenderMale, female, neutral
AgeChild (5-12 years), teenager (13-18 years), young adult (19-35 years), middle-aged (36-55 years), senior (55+ years)
PitchHigh, medium, low, slightly high, slightly low
Speech rateFast, medium, slow, slightly fast, slightly slow
EmotionCheerful, calm, gentle, serious, lively, cool, soothing
CharacteristicsMagnetic, crisp, raspy, mellow, sweet, rich, powerful
PurposeNews broadcasting, advertisement voice-over, audiobooks, animated characters, voice assistants, documentary narration

Example comparison

Good cases:
  • "Young and lively female voice, fast speech rate with noticeable rising intonation, suitable for introducing fashion products."
    • Analysis: This description combines age, personality, speech rate, and intonation, and specifies the use case, creating a clear voice profile.
  • "Calm middle-aged male, slow speech rate, deep and magnetic voice quality, suitable for reading news or documentary narration."
    • Analysis: This description clearly defines gender, age range, speech rate, voice quality, and intended use.
  • "Cute child's voice, approximately 8-year-old girl, slightly childish speech, suitable for animated character dubbing."
    • Analysis: This description pinpoints the specific age and voice quality (childishness) and has a clear purpose.
  • "Gentle and intellectual female, around 30 years old, calm tone, suitable for audiobook narration."
    • Analysis: This description effectively conveys voice emotion and style through terms such as "intellectual" and "calm."
Bad cases and suggestions:
Bad caseMain issueImprovement suggestion
'Nice-sounding voice'This description is too vague and subjective, and lacks actionable detail.Add specific dimensions, such as "Clear-toned young female voice with gentle intonation."
'Voice like a celebrity'This poses a copyright risk. The model does not support direct imitation.Extract the voice characteristics for the description, such as "Mature, magnetic, steady-paced male voice."
'Very very very nice female voice'This description is redundant. Repeating words does not help define the voice.Remove repetitions and add effective descriptions, such as "A 20- to 24-year-old female voice with a light, cheerful tone, lively pitch, and sweet quality."
123456This is an invalid input. It cannot be parsed as voice characteristics.Provide a meaningful text description. For more information, see the recommended examples above.

API reference

System voices

  • CosyVoice
  • Qwen-TTS-Realtime

FAQ

  • CosyVoice
  • Qwen-TTS-Realtime
  • Replace characters with multiple pronunciations with homophones to quickly resolve pronunciation issues.
  • Use the Speech Synthesis Markup Language (SSML) to control pronunciation.