CosyVoice real-time speech synthesis WebSocket server event reference
User guide: For model introduction and selection recommendations, see Speech synthesis.
After the client sends the
After the client sends text, the server continuously returns
The server returns a
The server returns a
task-started
After the client sends the run-task command, the server returns a task-started event to signal that the task has started. The client can send subsequent commands only after receiving this event.
Example
| Field | Type | Description |
|---|---|---|
| header.task_id | string | The task ID generated by the client. |
| header.event | string | Event type. Fixed value: task-started. |
| payload | object | Empty object. |
result-generated
After the client sends text, the server continuously returns result-generated events. Each event carries sentence-level metadata.
- sentence-begin
- sentence-synthesis
- sentence-end
| Field | Type | Description |
|---|---|---|
| header.task_id | string | The task ID generated by the client. |
| header.event | string | Event type. Fixed value: result-generated. |
| payload.output.type | string | Sub-event type. Valid values: sentence-begin (sentence start, returns the text to be synthesized), sentence-synthesis (marks an audio frame, one audio frame is transmitted over the WebSocket binary channel immediately after each event), sentence-end (sentence end, returns the text content and cumulative character count). |
| payload.output.sentence.index | integer | Sentence index, starting from 0. |
| payload.output.sentence.words | array | Word-level timestamp array. |
| payload.output.sentence.words[].text | string | Text content of the word. |
| payload.output.sentence.words[].begin_index | integer | Start character index of the word within the sentence. Starts at 0. |
| payload.output.sentence.words[].end_index | integer | End character index of the word within the sentence. Starts at 1. |
| payload.output.sentence.words[].begin_time | integer | Start time of the word's corresponding audio, in milliseconds. |
| payload.output.sentence.words[].end_time | integer | End time of the word's corresponding audio, in milliseconds. |
| payload.output.original_text | string | Text of the sentence as segmented for synthesis. |
| payload.usage.characters | integer | Cumulative number of billed characters (returned in the sentence-end event). |
task-finished
The server returns a task-finished event when the task completes. The client can then close the WebSocket connection or reuse it to start a new task.
Example
| Field | Type | Description |
|---|---|---|
| header.task_id | string | The task ID generated by the client. |
| header.event | string | Event type. Fixed value: task-finished. |
| payload.usage.characters | integer | Cumulative number of billed characters. |
task-failed
The server returns a task-failed event when the task fails. On receiving this event, the client must close the WebSocket connection and handle the error.
Example
| Field | Type | Description |
|---|---|---|
| header.task_id | string | The task ID generated by the client. |
| header.event | string | Event type. Fixed value: task-failed. |
| header.error_code | string | Error code. |
| header.error_message | string | Detailed error message. |