CosyVoice Python reference
User guide: For model overviews and voice selection, see Text-to-speech models.
See Text-to-speech models.
Use UTF-8 encoding.
Math expression parsing (basic operations, algebra, geometry) is available for cosyvoice-v3-flash and cosyvoice-v3-plus. This feature supports primary and secondary school level expressions.
See Convert LaTeX formulas to speech (Chinese language only).
SSML is available for custom voices (voice design/cloning) using cosyvoice-v3-flash and v3-plus, plus system voices marked as supported in the voice list.
The SpeechSynthesizer class supports three call methods:
Send full text at once without a callback. Returns complete audio in one response.
Instantiate SpeechSynthesizer with request parameters, then call
Send full text, stream audio via
Instantiate SpeechSynthesizer with request parameters and callback (ResultCallback), then call
Submit text in multiple parts within a single task and receive results in real time through callbacks.
Set parameters via the SpeechSynthesizer constructor.
Setting
Setting
Setting
Setting
Setting
Setting
All models support the following formats and sample rates:
Import
In unidirectional streaming or bidirectional streaming, the server returns process information and data via callbacks. Implement these methods to handle server responses.
Import using
The server returns binary audio data:
For more examples, see GitHub.
Prerequisites
- Sign in to Qwen Cloud and create an API key. To avoid security risks, export the API key as an environment variable instead of hard-coding it.
For temporary access to third-party apps or users, or to control high-risk operations like accessing or deleting sensitive data, use a temporary authentication token.Temporary tokens expire in 60 seconds, reducing leakage risk compared to long-term API keys. Replace the API key in your authentication code with the temporary token.
Models and pricing
See Text-to-speech models.
Text and format limits
Text length limits
- Non-streaming and unidirectional streaming: Maximum of 20,000 characters per request.
- Bidirectional streaming: Maximum of 20,000 characters per request and 200,000 cumulative across all requests.
Character counting rules
- Chinese characters (simplified/traditional, Japanese Kanji, Korean Hanja) count as 2. All other characters (punctuation, letters, numbers, Kana, Hangul) count as 1.
- SSML tags are excluded from the character count.
- Examples:
"你好"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters"中A文123"→ 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters"中文。"→ 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters"中 文。"→ 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters"<speak>你好</speak>"→ 2 (Chinese character) + 2 (Chinese character) = 4 characters
Encoding format
Use UTF-8 encoding.
Math expression support
Math expression parsing (basic operations, algebra, geometry) is available for cosyvoice-v3-flash and cosyvoice-v3-plus. This feature supports primary and secondary school level expressions.
This feature only supports Chinese.
SSML support
SSML is available for custom voices (voice design/cloning) using cosyvoice-v3-flash and v3-plus, plus system voices marked as supported in the voice list.
- Requires DashScope SDK 1.23.4 or later.
- Supported methods: Non-streaming and unidirectional streaming (
callmethod only). Bidirectional streaming (streaming_call) is not supported. - Pass SSML text to the
callmethod as with regular text.
Getting started
The SpeechSynthesizer class supports three call methods:
- Non-streaming: Blocking call. Sends full text, returns complete audio. Best for short text.
- Unidirectional streaming: Non-blocking. Sends full text, receives audio via callback. Best for short text with low latency.
- Bidirectional streaming: Non-blocking. Sends text fragments incrementally, receives audio in real time via callback. Best for long text with low latency.
Non-streaming
Send full text at once without a callback. Returns complete audio in one response.
call to get binary audio.
Maximum 20,000 characters (see SpeechSynthesizer call method).
Re-initialize the
SpeechSynthesizer instance before each call.View full example
View full example
Unidirectional streaming
Send full text, stream audio via ResultCallback. Receive results in real time.
call. Receive results via the on_data callback.
Maximum 20,000 characters (see SpeechSynthesizer call method).
Re-initialize the
SpeechSynthesizer instance before each call.View full example
View full example
Bidirectional streaming
Submit text in multiple parts within a single task and receive results in real time through callbacks.
-
Call
streaming_callmultiple times to submit text fragments in order. The server auto-splits into sentences:- Complete sentences: synthesized immediately
- Incomplete sentences: cached until complete, then synthesized
streaming_completeto force synthesis of all remaining fragments. -
Fragment submission interval: max 23 seconds (fixed server timeout, non-configurable). Exceeding this throws a "request timeout after 23 seconds" error. Call
streaming_completepromptly when done.
1
Instantiate SpeechSynthesizer
Instantiate SpeechSynthesizer with request parameters and the callback (ResultCallback).
2
Stream text
Call
streaming_call multiple times to submit text fragments. The server returns audio in real time via the on_data callback.Each fragment must not exceed 20,000 characters, and the cumulative total must not exceed 200,000 characters.3
Finish
Call
streaming_complete to end the synthesis. This blocks until on_complete or on_error triggers.Always call this method. Otherwise, trailing text may not convert to speech.View full example
View full example
Request parameters
Set parameters via the SpeechSynthesizer constructor.
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | The model for text-to-speech. See Voice list for all options. |
| voice | str | Yes | The voice for synthesis. See Voice list for available system voices. |
| format | enum | No | Audio format and sample rate. Default: MP3 at 22.05 kHz. Note: The default rate is optimal for the selected voice. Downsampling and upsampling are supported. All models support WAV, MP3, and PCM at 8/16/22.05/24/44.1/48 kHz. OPUS (OGG_OPUS) at 8/16/24/48 kHz with configurable bitrate (requires SDK 1.24.0+). See format reference. |
| volume | int | No | Volume. Default: 50. Range: [0, 100]. Scales linearly (0 = silent, 50 = default, 100 = max). Important: SDK 1.20.10 and later: field name is volume. |
| speech_rate | float | No | Speech rate. Default: 1.0. Range: [0.5, 2.0]. Values below 1.0 slow down speech; values above 1.0 speed it up. |
| pitch_rate | float | No | Pitch multiplier. The relationship with perceived pitch is non-linear; test to find a suitable value. Default: 1.0. Range: [0.5, 2.0]. >1.0 = higher pitch, <1.0 = lower pitch. |
| bit_rate | int | No | Audio bitrate in kbps. For Opus format, adjust with bit_rate. Default: 32. Range: [6, 510]. Set via additional_params (see example below). |
| word_timestamp_enabled | bool | No | Enable word-level timestamps. Default: False. Supports system voices marked as supported in the voice list. Timestamps are available only through the callback interface. Set via additional_params (see example below). |
| seed | int | No | Random seed for generation. Different seeds produce different results. With identical model, text, voice, and other parameters, the same seed reproduces the same output. Default: 0. Range: [0, 65535]. |
| language_hints | list[str] | No | Target language. Valid values: zh, en, fr, de, ja, ko, ru, pt, th, id, vi. Array parameter, but only the first element is used. |
| instruction | str | No | Controls dialect, emotion, or speaking style. Available only for system voices marked as supporting Instruct in the voice list. Max length: 100 characters. See instruction examples. |
| enable_aigc_tag | bool | No | Add an invisible AIGC identifier to generated audio. When True, the identifier is embedded in supported formats (WAV, MP3, OPUS). Default: False. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. Set via additional_params (see example below). |
| aigc_propagator | str | No | Set the ContentPropagator field in the AIGC identifier. Only effective when enable_aigc_tag is True. Default: UID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. Set via additional_params. |
| aigc_propagate_id | str | No | Set the PropagateID field in the AIGC identifier. Only effective when enable_aigc_tag is True. Default: the request ID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. Set via additional_params. |
| hot_fix | dict | No | Text hotpatching. Customize pronunciation of specific words or replace text before synthesis. Available only for cosyvoice-v3-flash. See hot_fix example. |
| enable_markdown_filter | bool | No | Enable Markdown filtering. When enabled, Markdown symbols are removed from input text before synthesis. Available only for cosyvoice-v3-flash. Default: False. Set via additional_params. |
| callback | ResultCallback | No | Callback interface (ResultCallback). |
additional_params examples
Setting bit_rate:
word_timestamp_enabled:
View full example code for word timestamps
View full example code for word timestamps
enable_aigc_tag:
aigc_propagator:
aigc_propagate_id:
enable_markdown_filter:
hot_fix example
Instruction examples
- cosyvoice-v3-flash (cloned voices)
- cosyvoice-v3-flash (system voices)
Use any natural language instruction to control synthesis effects.
Format reference
All models support the following formats and sample rates:
AudioFormat.WAV_8000HZ_MONO_16BIT: WAV format, 8 kHz sample rateAudioFormat.WAV_16000HZ_MONO_16BIT: WAV format, 16 kHz sample rateAudioFormat.WAV_22050HZ_MONO_16BIT: WAV format, 22.05 kHz sample rateAudioFormat.WAV_24000HZ_MONO_16BIT: WAV format, 24 kHz sample rateAudioFormat.WAV_44100HZ_MONO_16BIT: WAV format, 44.1 kHz sample rateAudioFormat.WAV_48000HZ_MONO_16BIT: WAV format, 48 kHz sample rateAudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format, 8 kHz sample rateAudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format, 16 kHz sample rateAudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format, 22.05 kHz sample rateAudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format, 24 kHz sample rateAudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format, 44.1 kHz sample rateAudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format, 48 kHz sample rateAudioFormat.PCM_8000HZ_MONO_16BIT: PCM format, 8 kHz sample rateAudioFormat.PCM_16000HZ_MONO_16BIT: PCM format, 16 kHz sample rateAudioFormat.PCM_22050HZ_MONO_16BIT: PCM format, 22.05 kHz sample rateAudioFormat.PCM_24000HZ_MONO_16BIT: PCM format, 24 kHz sample rateAudioFormat.PCM_44100HZ_MONO_16BIT: PCM format, 44.1 kHz sample rateAudioFormat.PCM_48000HZ_MONO_16BIT: PCM format, 48 kHz sample rate
bit_rate parameter. Requires DashScope SDK 1.24.0 or later:
AudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: OPUS format, 8 kHz sample rate, 32 kbps bitrateAudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: OPUS format, 16 kHz sample rate, 16 kbps bitrateAudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: OPUS format, 16 kHz sample rate, 32 kbps bitrateAudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: OPUS format, 16 kHz sample rate, 64 kbps bitrateAudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: OPUS format, 24 kHz sample rate, 16 kbps bitrateAudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: OPUS format, 24 kHz sample rate, 32 kbps bitrateAudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: OPUS format, 24 kHz sample rate, 64 kbps bitrateAudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: OPUS format, 48 kHz sample rate, 16 kbps bitrateAudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: OPUS format, 48 kHz sample rate, 32 kbps bitrateAudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: OPUS format, 48 kHz sample rate, 64 kbps bitrate
Key interfaces
SpeechSynthesizer class
Import SpeechSynthesizer with from dashscope.audio.tts_v2 import *. This class provides the core text-to-speech interfaces.
| Method | Parameters | Return value | Description |
|---|---|---|---|
def call(self, text: str, timeout_millis=None) | text: Text to synthesize. timeout_millis: Timeout in milliseconds. No effect if unset or 0. | Binary audio data if no ResultCallback is set; otherwise None. | Convert text (plain or SSML) to speech. No callback: Blocks until complete, returns binary audio. See Non-streaming. With callback: Returns None immediately, delivers results via on_data. See Unidirectional streaming. Important: Re-initialize SpeechSynthesizer before each call. |
def streaming_call(self, text: str) | text: Text fragment to synthesize | None | Stream text fragments for synthesis (SSML not supported). Call multiple times to send fragments. Results arrive via on_data in ResultCallback. See Bidirectional streaming. |
def streaming_complete(self, complete_timeout_millis=600000) | complete_timeout_millis: Wait time in milliseconds | None | End streaming synthesis. Blocks for complete_timeout_millis ms until the task ends. 0 = wait indefinitely. Default: 10 minutes. See Bidirectional streaming. Important: Always call this in bidirectional streaming to avoid missing speech. |
def get_last_request_id(self) | None | Request ID | Get the request ID of the previous task. |
def get_first_package_delay(self) | None | First-packet delay in ms | Get first-packet latency (time from sending text to receiving first audio packet). Call after the task completes. Factors: WebSocket connection setup (first call), voice loading, service load, network latency. Typical range: ~500 ms (reusing connection/voice), 1,500-2,000 ms (first connection or voice switch). If consistently >2,000 ms: 1. Use connection pooling for high concurrency. 2. Check network quality. 3. Avoid peak hours. |
def get_response(self) | None | Last message (JSON) | Get the last message, useful for detecting task-failed errors. |
Callback interface (ResultCallback)
In unidirectional streaming or bidirectional streaming, the server returns process information and data via callbacks. Implement these methods to handle server responses.
Import using from dashscope.audio.tts_v2 import *.
View example
View example
| Method | Parameters | Return value | Description |
|---|---|---|---|
def on_open(self) -> None | None | None | Called when the client connects to the server. |
def on_event(self, message: str) -> None | message: Server message (JSON string) | None | Called when the server sends a message. Parse message to get the task ID (task_id) and billed character count (characters). |
def on_complete(self) -> None | None | None | Called when all audio data has been returned. |
def on_error(self, message) -> None | message: Error message | None | Called when an error occurs. |
def on_data(self, data: bytes) -> None | data: Binary audio data | None | Called when synthesized audio arrives. Combine binary data into a complete file or play it with a streaming player. Important: For compressed formats (MP3, Opus), use a streaming player (FFmpeg, PyAudio, AudioFormat, MediaSource). Do not play frame by frame, as this causes decoding failures. When writing to a file, use append mode. For WAV and MP3, only the first frame contains header information. |
def on_close(self) -> None | None | None | Called when the server closes the connection. |
Response
The server returns binary audio data:
- Non-streaming: Handle the binary data returned by
callin SpeechSynthesizer. - Unidirectional streaming or bidirectional streaming: Handle the
dataparameter (bytes) inon_dataof ResultCallback.