The voice used for speech synthesis.
- System voices: See Voice list
- Cloned voices: Custom voices created through voice cloning
- Custom voices: Custom voices created through voice design
The audio encoding format. Valid values: pcm, wav, mp3 (default), opus.
The audio sample rate in Hz. Valid values: 8000, 16000, 22050 (default), 24000, 44100, 48000.
The volume level. Default value: 50. Valid values: [0, 100].
The speech rate. Default value: 1.0. Valid values: [0.5, 2.0].
The pitch. Default value: 1.0. Valid values: [0.5, 2.0].
The audio bit rate in kbps. When the audio format is opus, use bit_rate to adjust the bit rate. Default value: 32. Valid values: [6, 510].
Specifies whether to enable SSML. Default value: false. When set to true, only one continue-task command is allowed.
Specifies whether to enable word-level timestamps. Default value: false.Supported for cloned voices of cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2, and for system voices marked as supported in Voice list. A random seed for controlling variation in the synthesis output. When the model version, text, voice, and other parameters are unchanged, using the same seed produces identical results. Default value: 0. Valid values: [0, 65535].
Specifies the target language for speech synthesis to improve output quality.
- This parameter is an array, but the current version only processes the first element. Pass a single value.
- This parameter specifies the target language for speech synthesis. It's unrelated to the language of the audio sample used in voice cloning. To set the source language for a cloning task, see the voice cloning API reference.
When digit pronunciation, abbreviation expansion, symbol reading, or minority-language synthesis doesn't meet expectations, use this parameter.Valid values: zh (Chinese), en (English), fr (French), de (German), ja (Japanese), ko (Korean), ru (Russian), pt (Portuguese), th (Thai), id (Indonesian), vi (Vietnamese). Sets an instruction to control dialect, emotion, or voice character during synthesis. This feature is only available for cloned voices of cosyvoice-v3.5-flash, cosyvoice-v3.5-plus, and cosyvoice-v3-flash, as well as system voices marked as supporting Instruct in Voice list.Length limit: 100 characters. Chinese characters (including simplified and traditional Chinese, Japanese kanji, and Korean hanja) count as 2 characters. All other characters count as 1 character.Usage requirements (vary by model):
-
cosyvoice-v3.5-flash and cosyvoice-v3.5-plus: Accept any instruction to control synthesis effects (such as emotion or speech rate).
cosyvoice-v3.5-flash and cosyvoice-v3.5-plus don't have system voices. Only designed or cloned voices are supported.
-
cosyvoice-v3-flash:
- Cloned voices: Accept any natural-language instruction to control synthesis effects.
- System voices: Instructions must follow a fixed format. For details, see Voice list.
Specifies whether to embed an AIGC watermark in the generated audio. When set to true, the watermark is embedded in audio files of supported formats (wav/mp3/opus). Default value: false. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.
Sets the ContentPropagator field in the AIGC watermark, identifying the content propagator. Takes effect only when enable_aigc_tag is true. Default value: Alibaba Cloud UID. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.
Sets the PropagateID field in the AIGC watermark, uniquely identifying a specific propagation action. Takes effect only when enable_aigc_tag is true. Default value: The request ID of the current speech synthesis request. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.
Configures pronunciation corrections and text replacements applied before synthesis. Only cloned voices of cosyvoice-v3-flash support this feature.
pronunciation: Custom pronunciation. Specifies pinyin annotations for words to correct inaccurate default pronunciations.
replace: Text replacement. Replaces specified words with target text before synthesis.
"hot_fix": {
"pronunciation": [
{"weather": "tian1 qi4"}
],
"replace": [
{"today": "gold day"}
]
}
Specifies whether to enable Markdown filtering. When enabled, the system automatically strips Markdown markup symbols from the input text before synthesis. Default value: false. Only cloned voices of cosyvoice-v3-flash support this feature.