Qwen-Audio-TTS/CosyVoice Python SDK

User guide: For model overviews and voice selection, see Text-to-speech models.

Text and format limits

Text length limits

Non-streaming and unidirectional streaming: Maximum of 20,000 characters per request.
Bidirectional streaming: Maximum of 20,000 characters per request and 200,000 cumulative across all requests.

Character counting rules

Chinese characters (simplified/traditional, Japanese Kanji, Korean Hanja) count as 2. All other characters (punctuation, letters, numbers, Kana, Hangul) count as 1.
SSML tags are excluded from the character count.
Examples:
- "你好" → 2 (Chinese character) + 2 (Chinese character) = 4 characters
- "中A文123" → 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters
- "中文。" → 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters
- "中文。" → 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters
- "<speak>你好</speak>" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

Encoding format

Use UTF-8 encoding.

Math expression support

Math expression parsing (basic operations, algebra, geometry) is available for cosyvoice-v3-flash and cosyvoice-v3-plus. This feature supports primary and secondary school level expressions.

This feature only supports Chinese.

See Convert LaTeX formulas to speech (Chinese language only).

SSML support

SSML is available for custom voices (voice design/cloning) using cosyvoice-v3-flash and v3-plus, plus system voices marked as supported in the voice list.

Requires DashScope SDK 1.23.4 or later.
Supported methods: Non-streaming and unidirectional streaming (call method only). Bidirectional streaming (streaming_call) is not supported.
Pass SSML text to the call method as with regular text.

Getting started

The SpeechSynthesizer class supports three call methods:

Non-streaming: Blocking call. Sends full text, returns complete audio. Best for short text.
Unidirectional streaming: Non-blocking. Sends full text, receives audio via callback. Best for short text with low latency.
Bidirectional streaming: Non-blocking. Sends text fragments incrementally, receives audio in real time via callback. Best for long text with low latency.

For complete code examples of all three methods, see Real-time speech synthesis.

Non-streaming

Send full text at once without a callback. Returns complete audio in one response.

Instantiate SpeechSynthesizer with request parameters, then call call to get binary audio. Maximum 20,000 characters (see SpeechSynthesizer call method).

Re-initialize the SpeechSynthesizer instance before each call.

Unidirectional streaming

Send full text, stream audio via ResultCallback. Receive results in real time.

Instantiate SpeechSynthesizer with request parameters and callback (ResultCallback), then call call. Receive results via the on_data callback. Maximum 20,000 characters (see SpeechSynthesizer call method).

Re-initialize the SpeechSynthesizer instance before each call.

Bidirectional streaming

Submit text in multiple parts within a single task and receive results in real time through callbacks.

Call streaming_call multiple times to submit text fragments in order. The server auto-splits into sentences:
- Complete sentences: synthesized immediately
- Incomplete sentences: cached until complete, then synthesized
Call streaming_complete to force synthesis of all remaining fragments.
Fragment submission interval: max 23 seconds (fixed server timeout, non-configurable). Exceeding this throws a "request timeout after 23 seconds" error. Call streaming_complete promptly when done.

Instantiate SpeechSynthesizer

Instantiate SpeechSynthesizer with request parameters and the callback (ResultCallback).

Stream text

Call streaming_call multiple times to submit text fragments. The server returns audio in real time via the on_data callback.Each fragment must not exceed 20,000 characters, and the cumulative total must not exceed 200,000 characters.

Finish

Call streaming_complete to end the synthesis. This blocks until on_complete or on_error triggers.Always call this method. Otherwise, trailing text may not convert to speech.

Request parameters

Set parameters via the SpeechSynthesizer constructor.

Parameter	Type	Required	Description
model	str	Yes	The model for text-to-speech. See Voice list for all options.
voice	str	Yes	The voice for synthesis. See Voice list for available system voices.
format	enum	No	Audio format and sample rate. Default: MP3 at 22.05 kHz. Note: The default rate is optimal for the selected voice. Downsampling and upsampling are supported. All models support WAV, MP3, and PCM at 8/16/22.05/24/44.1/48 kHz. OPUS (OGG_OPUS) at 8/16/24/48 kHz with configurable bitrate (requires SDK 1.24.0+). See format reference.
volume	int	No	Volume. Default: 50. Range: [0, 100]. Scales linearly (0 = silent, 50 = default, 100 = max). Important: SDK 1.20.10 and later: field name is `volume`.
speech_rate	float	No	Speech rate. Default: 1.0. Range: [0.5, 2.0]. Values below 1.0 slow down speech; values above 1.0 speed it up.
pitch_rate	float	No	Pitch multiplier. The relationship with perceived pitch is non-linear; test to find a suitable value. Default: 1.0. Range: [0.5, 2.0]. >1.0 = higher pitch, <1.0 = lower pitch.
bit_rate	int	No	Audio bitrate in kbps. For Opus format, adjust with `bit_rate`. Default: 32. Range: [6, 510]. Set via `additional_params` (see example below).
word_timestamp_enabled	bool	No	Enable word-level timestamps. Default: False. Applies only to cloned voices of the cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 models, and to system voices marked as supported in the voice list. Timestamps are available only through the callback interface. Set via `additional_params` (see example below).
seed	int	No	Random seed for generation. Different seeds produce different results. With identical model, text, voice, and other parameters, the same seed reproduces the same output. Default: 0. Range: [0, 65535].
language_hints	list[str]	No	Target language. Valid values: zh, en, fr, de, ja, ko, ru, pt, th, id, vi, es, it, ms, fil, ar. Array parameter, but only the first element is used.
instruction	str	No	Controls dialect, emotion, or speaking style. Available only for system voices marked as supporting Instruct in the voice list. Max length: 100 characters. See instruction examples.
enable_aigc_tag	bool	No	Add an invisible AIGC identifier to generated audio. When True, the identifier is embedded in supported formats (WAV, MP3, OPUS). Default: False. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. Set via `additional_params` (see example below).
aigc_propagator	str	No	Set the `ContentPropagator` field in the AIGC identifier. Only effective when `enable_aigc_tag` is `True`. Default: UID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. Set via `additional_params`.
aigc_propagate_id	str	No	Set the `PropagateID` field in the AIGC identifier. Only effective when `enable_aigc_tag` is `True`. Default: the request ID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. Set via `additional_params`.
hot_fix	dict	No	Text hotpatching. Customize pronunciation of specific words or replace text before synthesis. Available only for cosyvoice-v3-flash. See hot_fix example.
enable_markdown_filter	bool	No	Enable Markdown filtering. When enabled, Markdown symbols are removed from input text before synthesis. Available only for cosyvoice-v3-flash. Default: False. Set via `additional_params`.
callback	ResultCallback	No	Callback interface (ResultCallback).

`additional_params` examples

Setting bit_rate:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                additional_params={"bit_rate": 32})

Setting word_timestamp_enabled:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                callback=callback, # Timestamps are available only through the callback interface
                additional_params={'word_timestamp_enabled': True})

View full example code for word timestamps

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *
import json
from datetime import datetime


def get_timestamp():
  now = datetime.now()
  formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
  return formatted_timestamp


# If you have not configured your API key in an environment variable, replace YOUR_API_KEY with your actual key
# dashscope.api_key = "YOUR_API_KEY"

model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define callback interface
class Callback(ResultCallback):
  _player = None
  _stream = None

  def on_open(self):
    self.file = open("output.mp3", "wb")
    print("Connection established: " + get_timestamp())

  def on_complete(self):
    print("Speech synthesis completed, all results received: " + get_timestamp())

  def on_error(self, message: str):
    print(f"Speech synthesis error: {message}")

  def on_close(self):
    print("Connection closed: " + get_timestamp())
    self.file.close()

  def on_event(self, message):
    json_data = json.loads(message)
    if json_data['payload'] and json_data['payload']['output'] and json_data['payload']['output']['sentence']:
      sentence = json_data['payload']['output']['sentence']
      print(f'sentence: {sentence}')
      # Get sentence index
      # index = sentence['index']
      words = sentence['words']
      if words:
        for word in words:
          print(f'word: {word}')
          # Example: word: {'text': 'T', 'begin_index': 0, 'end_index': 1, 'begin_time': 80, 'end_time': 200}

  def on_data(self, data: bytes) -> None:
    print(get_timestamp() + " Binary audio length: " + str(len(data)))
    self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(
  model=model,
  voice=voice,
  callback=callback,
  additional_params={'word_timestamp_enabled': True}
)

# Send text for synthesis and retrieve binary audio in real time through the on_data method of the callback interface
synthesizer.call("How is the weather today?")
# First request includes WebSocket connection setup time
print('[Metric] requestId: {}, first-package delay: {} ms'.format(
  synthesizer.get_last_request_id(),
  synthesizer.get_first_package_delay()))

Setting enable_aigc_tag:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                additional_params={"enable_aigc_tag": True})

Setting aigc_propagator:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                additional_params={"enable_aigc_tag": True, "aigc_propagator": "xxxx"})

Setting aigc_propagate_id:

synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                additional_params={"enable_aigc_tag": True, "aigc_propagate_id": "xxxx"})

Setting enable_markdown_filter:

synthesizer = SpeechSynthesizer(
  model="cosyvoice-v3-flash",
  voice="your_voice", # Replace with a cosyvoice-v3-flash cloned voice
  additional_params={"enable_markdown_filter": True}
)

hot_fix example

synthesizer = SpeechSynthesizer(
  model="cosyvoice-v3-flash",
  voice="your_voice", # Replace with a cosyvoice-v3-flash cloned voice
  hot_fix={
    "pronunciation": [{"weather": "tian1 qi4"}],
    "replace": [{"today": "jin1 tian1"}]
  }
)

Instruction examples

cosyvoice-v3-flash (cloned voices)
cosyvoice-v3-flash (system voices)

Use any natural language instruction to control synthesis effects.

Please speak in Cantonese. (Supported dialects: Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan.)
Please say a sentence as loudly as possible.
Please say a sentence as slowly as possible.
Please say a sentence as quickly as possible.
Please say a sentence very softly.
Can you speak a little slower?
Can you speak very quickly?
Can you speak very slowly?
Can you speak a little faster?
Please say a sentence very angrily.
Please say a sentence very happily.
Please say a sentence very fearfully.
Please say a sentence very sadly.
Please say a sentence very surprisedly.
Please try to sound as firm as possible.
Please try to sound as angry as possible.
Please try an approachable tone.
Please speak in a cold tone.
Please speak in a majestic tone.
I want to experience a natural tone.
I want to see how you express a threat.
I want to see how you express wisdom.
I want to see how you express seduction.
I want to hear you speak in a lively way.
I want to hear you speak with passion.
I want to hear you speak in a steady manner.
I want to hear you speak with confidence.
Can you talk to me with excitement?
Can you show an arrogant emotion?
Can you show an elegant emotion?
Can you answer the question happily?
Can you give a gentle emotional demonstration?
Can you talk to me in a calm tone?
Can you answer me in a deep way?
Can you talk to me with a gruff attitude?
Tell me the answer in a sinister voice.
Tell me the answer in a resilient voice.
Narrate in a natural and friendly chat style.
Speak in the tone of a radio drama podcaster.

Format reference

All models support the following formats and sample rates:

AudioFormat.WAV_8000HZ_MONO_16BIT: WAV format, 8 kHz sample rate
AudioFormat.WAV_16000HZ_MONO_16BIT: WAV format, 16 kHz sample rate
AudioFormat.WAV_22050HZ_MONO_16BIT: WAV format, 22.05 kHz sample rate
AudioFormat.WAV_24000HZ_MONO_16BIT: WAV format, 24 kHz sample rate
AudioFormat.WAV_44100HZ_MONO_16BIT: WAV format, 44.1 kHz sample rate
AudioFormat.WAV_48000HZ_MONO_16BIT: WAV format, 48 kHz sample rate
AudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format, 8 kHz sample rate
AudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format, 16 kHz sample rate
AudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format, 22.05 kHz sample rate
AudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format, 24 kHz sample rate
AudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format, 44.1 kHz sample rate
AudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format, 48 kHz sample rate
AudioFormat.PCM_8000HZ_MONO_16BIT: PCM format, 8 kHz sample rate
AudioFormat.PCM_16000HZ_MONO_16BIT: PCM format, 16 kHz sample rate
AudioFormat.PCM_22050HZ_MONO_16BIT: PCM format, 22.05 kHz sample rate
AudioFormat.PCM_24000HZ_MONO_16BIT: PCM format, 24 kHz sample rate
AudioFormat.PCM_44100HZ_MONO_16BIT: PCM format, 44.1 kHz sample rate
AudioFormat.PCM_48000HZ_MONO_16BIT: PCM format, 48 kHz sample rate

For OPUS format, adjust bitrate with the bit_rate parameter. Requires DashScope SDK 1.24.0 or later:

AudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: OPUS format, 8 kHz sample rate, 32 kbps bitrate
AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: OPUS format, 16 kHz sample rate, 16 kbps bitrate
AudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: OPUS format, 16 kHz sample rate, 32 kbps bitrate
AudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: OPUS format, 16 kHz sample rate, 64 kbps bitrate
AudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: OPUS format, 24 kHz sample rate, 16 kbps bitrate
AudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: OPUS format, 24 kHz sample rate, 32 kbps bitrate
AudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: OPUS format, 24 kHz sample rate, 64 kbps bitrate
AudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: OPUS format, 48 kHz sample rate, 16 kbps bitrate
AudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: OPUS format, 48 kHz sample rate, 32 kbps bitrate
AudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: OPUS format, 48 kHz sample rate, 64 kbps bitrate

Key interfaces

`SpeechSynthesizer` class

Import SpeechSynthesizer with from dashscope.audio.tts_v2 import *. This class provides the core text-to-speech interfaces.

Method	Parameters	Return value	Description
`def call(self, text: str, timeout_millis=None)`	`text`: Text to synthesize. `timeout_millis`: Timeout in milliseconds. No effect if unset or 0.	Binary audio data if no `ResultCallback` is set; otherwise None.	Convert text (plain or SSML) to speech. No callback: Blocks until complete, returns binary audio. See Non-streaming. With callback: Returns None immediately, delivers results via `on_data`. See Unidirectional streaming. Important: Re-initialize `SpeechSynthesizer` before each call.
`def streaming_call(self, text: str)`	`text`: Text fragment to synthesize	None	Stream text fragments for synthesis (SSML not supported). Call multiple times to send fragments. Results arrive via `on_data` in ResultCallback. See Bidirectional streaming.
`def streaming_complete(self, complete_timeout_millis=600000)`	`complete_timeout_millis`: Wait time in milliseconds	None	End streaming synthesis. Blocks for `complete_timeout_millis` ms until the task ends. 0 = wait indefinitely. Default: 10 minutes. See Bidirectional streaming. Important: Always call this in bidirectional streaming to avoid missing speech.
`def get_last_request_id(self)`	None	Request ID	Get the request ID of the previous task.
`def get_first_package_delay(self)`	None	First-packet delay in ms	Get first-packet latency (time from sending text to receiving first audio packet). Call after the task completes. Factors: WebSocket connection setup (first call), voice loading, service load, network latency. Typical range: ~500 ms (reusing connection/voice), 1,500-2,000 ms (first connection or voice switch). If consistently >2,000 ms: 1. Use connection pooling for high concurrency. 2. Check network quality. 3. Avoid peak hours.
`def get_response(self)`	None	Last message (JSON)	Get the last message, useful for detecting task-failed errors.

Callback interface (`ResultCallback`)

In unidirectional streaming or bidirectional streaming, the server returns process information and data via callbacks. Implement these methods to handle server responses. Import using from dashscope.audio.tts_v2 import *.

View example

class Callback(ResultCallback):
  def on_open(self) -> None:
    print('Connection successful')

  def on_data(self, data: bytes) -> None:
    # Implement logic to handle binary audio data
    pass

  def on_complete(self) -> None:
    print('Synthesis complete')

  def on_error(self, message) -> None:
    print('Error: ', message)

  def on_close(self) -> None:
    print('Connection closed')


callback = Callback()

Method	Parameters	Return value	Description
`def on_open(self) -> None`	None	None	Called when the client connects to the server.
`def on_event(self, message: str) -> None`	`message`: Server message (JSON string)	None	Called when the server sends a message. Parse `message` to get the task ID (`task_id`) and billed character count (`characters`).
`def on_complete(self) -> None`	None	None	Called when all audio data has been returned.
`def on_error(self, message) -> None`	`message`: Error message	None	Called when an error occurs.
`def on_data(self, data: bytes) -> None`	`data`: Binary audio data	None	Called when synthesized audio arrives. Combine binary data into a complete file or play it with a streaming player. Important: For compressed formats (MP3, Opus), use a streaming player (FFmpeg, PyAudio, AudioFormat, MediaSource). Do not play frame by frame, as this causes decoding failures. When writing to a file, use append mode. For WAV and MP3, only the first frame contains header information.
`def on_close(self) -> None`	None	None	Called when the server closes the connection.

Response

The server returns binary audio data:

Non-streaming: Handle the binary data returned by call in SpeechSynthesizer.
Unidirectional streaming or bidirectional streaming: Handle the data parameter (bytes) in on_data of ResultCallback.

More examples

For more examples, see GitHub.

​Text and format limits

​Text length limits

​Character counting rules

​Encoding format

​Math expression support

​SSML support

​Getting started

​Non-streaming

​Unidirectional streaming

​Bidirectional streaming

​Request parameters

​additional_params examples

​hot_fix example

​Instruction examples

​Format reference

​Key interfaces

​SpeechSynthesizer class

​Callback interface (ResultCallback)

​Response

​More examples

Text and format limits

Text length limits

Character counting rules

Encoding format

Math expression support

SSML support

Getting started

Non-streaming

Unidirectional streaming

Bidirectional streaming

Request parameters

`additional_params` examples

hot_fix example

Instruction examples

Format reference

Key interfaces

`SpeechSynthesizer` class

Callback interface (`ResultCallback`)

Response

More examples