CosyVoice client events

User guide: For model introduction and selection recommendations, see Speech synthesis.

run-task

Description: Starts a speech synthesis task and configures the model, voice, sample rate, and other parameters. When to send: Immediately after the WebSocket connection is established. Response event: The server returns a task-started event. Wait for this event before sending subsequent commands.

Example

{
    "header": {
        "action": "run-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "task_group": "audio",
        "task": "tts",
        "function": "SpeechSynthesizer",
        "model": "cosyvoice-v3-flash",
        "parameters": {
            "text_type": "PlainText",
            "voice": "longanyang",
            "format": "mp3",
            "sample_rate": 22050,
            "volume": 50,
            "rate": 1.0,
            "pitch": 1.0,
            "enable_ssml": false
        },
        "input": {}
    }
}

object

body

required

Message header.

Show Properties

string

body

required

The command type. Set to run-task.

string

body

required

A client-generated task ID in UUID format. This ID correlates subsequent events and must match the task_id in the continue-task and finish-task commands.

string

body

required

Set to duplex.

object

body

required

Request body.

Show Properties

string

body

required

The task group. Set to audio.

string

body

required

The task type. Set to tts.

string

body

required

The function type. Set to SpeechSynthesizer.

string

body

required

The model name.

object

body

required

Set to an empty object {}. Send the text to synthesize through the continue-task command.

object

body

required

Speech synthesis parameters.

Show Properties

string

body

required

Set to PlainText.

string

body

required

The voice used for speech synthesis.

System voices: See CosyVoice Voice list
Cloned voices: Custom voices created through voice cloning
Custom voices: Custom voices created through voice design

string

body

The audio encoding format. Valid values: pcm, wav, mp3 (default), opus.

integer

body

The audio sample rate in Hz. Valid values: 8000, 16000, 22050 (default), 24000, 44100, 48000.

integer

body

The volume level. Default value: 50. Valid values: [0, 100].

float

body

The speech rate. Default value: 1.0. Valid values: [0.5, 2.0].

float

body

The pitch. Default value: 1.0. Valid values: [0.5, 2.0].

integer

body

The audio bit rate in kbps. When the audio format is opus, use bit_rate to adjust the bit rate. Default value: 32. Valid values: [6, 510].

boolean

body

Specifies whether to enable SSML. Default value: false. When set to true, only one continue-task command is allowed.

boolean

body

Specifies whether to enable word-level timestamps. Default value: false.Available only in streaming output mode. Supported voices: cloned voices of cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2, and system voices marked as supported in CosyVoice Voice list. Cloned voices of other models do not support this feature.

integer

body

A random seed for controlling variation in the synthesis output. When the model version, text, voice, and other parameters are unchanged, using the same seed produces identical results. Default value: 0. Valid values: [0, 65535].

array[string]

body

Specifies the target language for speech synthesis to improve output quality.

This parameter is an array, but the current version only processes the first element. Pass a single value.
This parameter specifies the target language for speech synthesis. It's unrelated to the language of the audio sample used in voice cloning. To set the source language for a cloning task, see the voice cloning API reference.

When digit pronunciation, abbreviation expansion, symbol reading, or minority-language synthesis doesn't meet expectations, use this parameter.Valid values: zh (Chinese), en (English), fr (French), de (German), ja (Japanese), ko (Korean), ru (Russian), pt (Portuguese), th (Thai), id (Indonesian), vi (Vietnamese).

string

body

Sets an instruction to control dialect, emotion, or voice character during synthesis. This feature is only available for cloned voices of cosyvoice-v3-flash, as well as system voices marked as supporting Instruct in CosyVoice Voice list.Length limit: 100 characters. Chinese characters (including simplified and traditional Chinese, Japanese kanji, and Korean hanja) count as 2 characters. All other characters count as 1 character.Usage requirements:

cosyvoice-v3-flash:
- Cloned voices: Accept any natural-language instruction to control synthesis effects.
- System voices: Instructions must follow a fixed format. For details, see CosyVoice Voice list.

boolean

body

Specifies whether to embed an AIGC watermark in the generated audio. When set to true, the watermark is embedded in audio files of supported formats (wav/mp3/opus). Default value: false. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

string

body

Sets the ContentPropagator field in the AIGC watermark, identifying the content propagator. Takes effect only when enable_aigc_tag is true. Default value: Alibaba Cloud UID. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

string

body

Sets the PropagateID field in the AIGC watermark, uniquely identifying a specific propagation action. Takes effect only when enable_aigc_tag is true. Default value: The request ID of the current speech synthesis request. Only cosyvoice-v3-flash, cosyvoice-v3-plus, and cosyvoice-v2 support this feature.

object

body

Configures pronunciation corrections and text replacements applied before synthesis. cosyvoice-v2 does not support this feature.

pronunciation: Custom pronunciation. Specifies pinyin annotations for words to correct inaccurate default pronunciations.
replace: Text replacement. Replaces specified words with target text before synthesis.

"hot_fix": {
  "pronunciation": [
    {"weather": "tian1 qi4"}
  ],
  "replace": [
    {"today": "gold day"}
  ]
}

boolean

body

Specifies whether to enable Markdown filtering. When enabled, the system automatically strips Markdown markup symbols from the input text before synthesis. Default value: false. Only cloned voices of cosyvoice-v3-flash support this feature.

continue-task

Description: Sends the text to synthesize. The text can be sent all at once or in multiple segments. When to send: After receiving the task-started event from the server. Limits:

Maximum of 20,000 characters per message
Maximum of 200,000 characters cumulatively
The send interval must not exceed 23 seconds; otherwise, the connection times out.

Example

{
    "header": {
        "action": "continue-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {
            "text": "Before my bed, moonlight shines bright, I suspect it's frost upon the ground."
        }
    }
}

object

body

required

Message header.

Show Properties

string

body

required

The command type. Set to continue-task.

string

body

required

The task ID in UUID format. Must match the task_id in run-task.

string

body

required

Set to duplex.

object

body

required

Request body.

Show Properties

object

body

required

Contains the text to synthesize.

Show Properties

string

body

required

The text to synthesize. Maximum of 20,000 characters per message and 200,000 characters cumulatively.

finish-task

Description: Notifies the server that all text has been sent and requests task completion. When to send: Immediately after all text has been sent. Response event: The server returns a task-finished event.

Example

{
    "header": {
        "action": "finish-task",
        "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
        "streaming": "duplex"
    },
    "payload": {
        "input": {}
    }
}

object

body

required

Message header.

Show Properties

string

body

required

The command type. Set to finish-task.

string

body

required

The task ID in UUID format. Must match the task_id in run-task.

string

body

required

Set to duplex.

object

body

required

Request body.

Show Properties

object

body

required

Set to {}.

​run-task

​continue-task

​finish-task

run-task

continue-task

finish-task