CosyVoice WebSocket API
Parameters and protocol for CosyVoice text to speech over WebSocket. The DashScope SDK supports Java and Python only -- use WebSocket for other languages.
User guide: For model overviews and voice selection, see Speech synthesis.
WebSocket enables full-duplex communication. The client and server establish a persistent connection with a single handshake, then push data to each other in real time.
Common WebSocket libraries:
Get an API key.
See Speech synthesis.
Send up to 20,000 characters per continue-task instruction. The total across all continue-task instructions must not exceed 200,000 characters.
Use UTF-8 encoding.
Math expression parsing is available for cosyvoice-v3-flash and cosyvoice-v3-plus. It covers common primary and secondary school math including basic operations, algebra, and geometry.
See Convert LaTeX formulas to speech (Chinese only).
SSML requires all of the following:
Client-to-server messages are instructions. Server-to-client messages are JSON events or binary audio streams.
The sequence:
Server responsibilities
The server delivers the complete audio stream in order. You do not need to handle audio ordering or completeness.
Client responsibilities
Common URL errors:
Authentication timingAuthentication occurs during the WebSocket handshake, not when sending run-task. If the Authorization header is missing or invalid, the server rejects the handshake with an HTTP 401 or 403 error. Client libraries typically report this as a WebSocketBadStatus exception.
If the WebSocket connection fails:
The browser
Instructions are JSON messages sent as WebSocket text frames. They control the task lifecycle.
Send instructions in this order:
Starts a text to speech task. Configure voice, sample rate, and other parameters here.
Python example for generating a task ID:
Use the same task_id for all subsequent continue-task and finish-task instructions.
Common error: Omitting the input field or adding unexpected fields (like mode or content) causes "InvalidParameter: task can not be null" or connection close (WebSocket code 1007).
payload.parameters:
When word_timestamp_enabled is enabled, timestamps appear in the result-generated event:
Instruction examples for cosyvoice-v3-flash with cloned voices:
For cosyvoice-v3-flash with system voices, the instruction must use a fixed format. See the voice list.
hot_fix example:
Sends text to synthesize. Send all text in one instruction, or split it across multiple instructions in order.
When to send: After receiving task-started.
Example:
Ends the task. Always send this instruction. Otherwise:
Events are JSON messages from the server. Each marks a stage in the task lifecycle.
Confirms the task has started. Send continue-task or finish-task only after receiving this event. Otherwise, the task fails.
The
While you send continue-task and finish-task instructions, the server returns
Marks the end of the task.
After the task ends, close the WebSocket connection or reuse it to send a new run-task instruction (see Connection overhead and reuse).
Example:
Indicates the task has failed. Close the WebSocket connection and review the error message.
Example:
During streaming synthesis, you can interrupt the current task early (for example, if the user cancels playback) using one of these methods:
The WebSocket service supports connection reuse.
Send run-task to start a task and finish-task to end it. After task-finished, reuse the same connection by sending a new run-task instruction.
See Rate limits.
To increase your concurrency quota, contact customer support. Quota adjustments require review and typically take 1 to 3 business days.
Typical connection time:
Basic connectivity example. Implement production-ready logic for your use case. Use asynchronous programming to send and receive simultaneously:
- Go:
gorilla/websocket - PHP:
Ratchet - Node.js:
ws
Prerequisites
Get an API key.
Models and pricing
See Speech synthesis.
Text and format limits
Text length limits
Send up to 20,000 characters per continue-task instruction. The total across all continue-task instructions must not exceed 200,000 characters.
Character counting rules
- Chinese characters (simplified, traditional, Japanese Kanji, Korean Hanja) count as two characters. All others (punctuation, letters, numbers, Kana/Hangul) count as one.
- SSML tags are excluded from the character count.
- Examples:
"你好"→ 2 + 2 = 4 characters"中A文123"→ 2 + 1 + 2 + 1 + 1 + 1 = 8 characters"中文。"→ 2 + 2 + 1 = 5 characters"中 文。"→ 2 + 1 + 2 + 1 = 6 characters"<speak>你好</speak>"→ 2 + 2 = 4 characters
Encoding format
Use UTF-8 encoding.
Math expression support
Math expression parsing is available for cosyvoice-v3-flash and cosyvoice-v3-plus. It covers common primary and secondary school math including basic operations, algebra, and geometry.
This feature supports Chinese only.
SSML support
SSML requires all of the following:
- Model: Only cosyvoice-v3-flash and cosyvoice-v3-plus support SSML.
- Voice: Use an SSML-enabled voice:
- All cloned voices (created through the Voice Cloning API).
- System voices marked as SSML-enabled in the voice list.
System voices without SSML support (such as some basic voices) return the error "SSML text is not supported at the moment!" even withenable_ssmlenabled. - Parameter: Set
enable_ssmltotruein the run-task instruction.
Interaction flow
- Open a WebSocket connection.
- Send the run-task instruction to start a task.
- Wait for the task-started event before proceeding.
-
Send text:
Send one or more continue-task instructions in order. After receiving a complete sentence, the server returns a result-generated event and the audio stream. For text length constraints, see the
textfield in the continue-task instruction.Send multiple continue-task instructions to submit text fragments in order. The server segments text into sentences automatically:- Complete sentences are synthesized immediately.
- Incomplete sentences are buffered until complete. No audio is returned for incomplete sentences.
-
Receive the audio stream through the
binarychannel. - After sending all text, send the finish-task instruction. Continue receiving the audio stream. Do not skip this step, or the ending portion of the audio may be lost.
- Receive the task-finished event from the server.
- Close the WebSocket connection.
- Disordered audio delivery.
- Misaligned speech content.
- Abnormal task state, possibly preventing receipt of the task-finished event.
- Billing failures or inaccurate usage statistics.
- Generate a unique task_id (for example, UUID) when sending run-task.
- Store the task_id in a variable.
- Use this task_id for all subsequent continue-task and finish-task instructions.
- After receiving task-finished, generate a new task_id for the next task.
Client implementation tips
Server and client responsibilities
Server responsibilities
The server delivers the complete audio stream in order. You do not need to handle audio ordering or completeness.
Client responsibilities
- Read and concatenate all audio chunks The server delivers audio as multiple binary frames. Receive all frames and concatenate them:
-
Maintain a complete WebSocket lifecycle
Do not disconnect during the task, from sending run-task to receiving task-finished. Common mistakes:
- Closing the connection before all audio chunks arrive, resulting in incomplete audio.
- Forgetting to send finish-task, leaving text buffered and unprocessed.
- Failing to handle WebSocket keepalive during page navigation or app backgrounding.
-
Text integrity in ASR-to-LLM-to-TTS workflows
Ensure the text passed to TTS is complete:
- Wait for the LLM to generate a full sentence before sending continue-task, rather than streaming character-by-character.
- For streaming synthesis, send text at natural sentence boundaries (periods, question marks).
- After the LLM finishes generating, always send finish-task to avoid missing trailing content.
Platform-specific tips
- Flutter: Close the connection in the
disposemethod to prevent memory leaks when usingweb_socket_channel. Handle app lifecycle events (such asAppLifecycleState.paused) for background transitions. - Web (browser): Some browsers limit WebSocket connections. Reuse a single connection for multiple tasks. Use
beforeunloadto close the connection before the page closes. - Mobile (iOS/Android native): The OS may pause or terminate network connections when the app enters the background. Use a background task or foreground service to keep the WebSocket active, or reinitialize the task on foreground return.
URL
- Wrong protocol: Use
wss://, nothttp://orhttps://. - Auth in query string: Do not put Authorization in the URL (such as
?Authorization=bearer YOUR_API_KEY). Set it in the HTTP handshake headers. See Headers. - Extra path segments: Do not append model names or other parameters to the URL. Specify the model in
payload.modelin the run-task instruction.
Headers
| Parameter | Type | Required | Description |
|---|---|---|---|
| Authorization | string | Yes | Authentication token. Format: Bearer $DASHSCOPE_API_KEY. |
| user-agent | string | No | Client identifier for source tracking. |
| X-DashScope-WorkSpace | string | No | Your Qwen Code workspace ID. |
| X-DashScope-DataInspection | string | No | Data compliance inspection. Default: enable. Do not set unless necessary. |
Troubleshoot authentication failures
If the WebSocket connection fails:
- Check API key format: Confirm the Authorization header uses
bearer YOUR_API_KEYwith a space betweenbearerand the key. - Verify API key validity: Check your API keys page to confirm the key is active and authorized for CosyVoice models.
- Check header placement: Set the Authorization header during the WebSocket handshake. Examples by language:
- Python (websockets):
extra_headers={"Authorization": f"bearer {api_key}"} - JavaScript: The browser WebSocket API does not support custom headers. Use a server-side proxy or another library such as ws.
- Go (gorilla/websocket):
header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
- Python (websockets):
- Test network connectivity: Use curl or Postman to verify the API key by calling other HTTP-supported DashScope APIs.
Using WebSocket in browsers
The browser new WebSocket(url) API does not support custom request headers (including Authorization) during the handshake. You cannot authenticate directly from frontend code.
Solution: Use a backend proxy
- Connect to CosyVoice from your backend (Node.js, Java, or Python), where you can set the Authorization header.
- Have the frontend connect to your backend via WebSocket, which forwards messages to CosyVoice.
- This keeps the API key hidden and lets you add authentication, logging, or rate limiting.
- Frontend (native web) + Backend (Node.js Express): cosyvoiceNodeJs_en.zip
- Frontend (native web) + Backend (Python Flask): cosyvoiceFlask_en.zip
Instructions (client to server)
Instructions are JSON messages sent as WebSocket text frames. They control the task lifecycle.
Send instructions in this order:
- Send run-task
- Starts the task.
- Use the same
task_idin all subsequent continue-task and finish-task instructions.
- Send continue-task
- Sends text to synthesize.
- Send only after receiving task-started.
- Send finish-task
- Ends the task.
- Send after all continue-task instructions are sent.
1. run-task instruction: Start a task
Starts a text to speech task. Configure voice, sample rate, and other parameters here.
- Timing: Send after the WebSocket connection is established.
- Do not send text here. Send text using continue-task instead.
- The input field is required but must be
{}. Omitting it causes the "task can not be null" error.
header parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| header.action | string | Yes | Fixed value: "run-task". |
| header.task_id | string | Yes | A 32-character UUID. Hyphens are optional (such as "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx" or "2bf83b9abaeb4fda8d9axxxxxxxxxxxx"). Most languages provide built-in UUID APIs. |
| header.streaming | string | Yes | Fixed value: "duplex". |
payload parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| payload.task_group | string | Yes | Fixed value: "audio". |
| payload.task | string | Yes | Fixed value: "tts". |
| payload.function | string | Yes | Fixed value: "SpeechSynthesizer". |
| payload.model | string | Yes | The text to speech model. See Voice list. |
| payload.input | object | Yes | Required but must be empty ({}) in run-task. Send text using continue-task. |
| Parameter | Type | Required | Description |
|---|---|---|---|
| text_type | string | Yes | Fixed value: "PlainText". |
| voice | string | Yes | Voice for synthesis. See Voice list for available system voices. |
| format | string | No | Audio format. Supports pcm, wav, mp3 (default), and opus. For opus, adjust bitrate with bit_rate. |
| sample_rate | integer | No | Sample rate in Hz. Default: 22050. Valid values: 8000, 16000, 22050, 24000, 44100, 48000. |
| volume | integer | No | Volume. Default: 50. Range: [0, 100]. Scales linearly. 0 is silent, 100 is maximum. |
| rate | float | No | Speech rate. Default: 1.0. Range: [0.5, 2.0]. Below 1.0 slows speech; above 1.0 speeds it up. |
| pitch | float | No | Pitch multiplier. Default: 1.0. Range: [0.5, 2.0]. The relationship with perceived pitch is not strictly linear. Test to find a suitable value. |
| enable_ssml | boolean | No | Enable SSML. When true, only one continue-task instruction is allowed. |
| bit_rate | int | No | Audio bitrate in kbps (for Opus format). Default: 32. Range: [6, 510]. |
| word_timestamp_enabled | boolean | No | Enable word-level timestamps. Default: false. Available for system voices marked as supported in the voice list. |
| seed | int | No | Random seed for generation. Same seed with identical parameters reproduces the same output. Default: 0. Range: [0, 65535]. |
| language_hints | array[string] | No | Target language for synthesis. Valid values: zh, en, fr, de, ja, ko, ru, pt, th, id, vi. This is an array, but only the first element is processed. |
| instruction | string | No | Controls synthesis effects such as dialect, emotion, or speaking style. Available for system voices marked as supporting Instruct in the voice list. Max length: 100 characters. |
| enable_aigc_tag | boolean | No | Add an invisible AIGC identifier to generated audio. When true, an identifier is embedded in WAV, MP3, and Opus formats. Default: false. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. |
| aigc_propagator | string | No | Sets the ContentPropagator field in the AIGC identifier. Takes effect only when enable_aigc_tag is true. Default: UID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. |
| aigc_propagate_id | string | No | Sets the PropagateID field in the AIGC identifier. Takes effect only when enable_aigc_tag is true. Default: the current request ID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. |
| hot_fix | object | No | Text hotpatching configuration. Customize pronunciation or replace text before synthesis. Available only for cosyvoice-v3-flash. |
| enable_markdown_filter | boolean | No | Enable Markdown filtering. Removes Markdown symbols from input text before synthesis. Default: false. Available only for cosyvoice-v3-flash. |
2. continue-task instruction
Sends text to synthesize. Send all text in one instruction, or split it across multiple instructions in order.
When to send: After receiving task-started.
Do not wait longer than 23 seconds between text fragments, or a "request timeout after 23 seconds" error occurs. If no more text remains, send finish-task to end the task. The 23-second timeout is server-enforced and cannot be modified.
header parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| header.action | string | Yes | Fixed value: "continue-task". |
| header.task_id | string | Yes | Must match the task_id from run-task. |
| header.streaming | string | Yes | Fixed value: "duplex". |
payload parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| input.text | string | Yes | Text to synthesize. |
3. finish-task instruction: End task
Ends the task. Always send this instruction. Otherwise:
- Incomplete audio: The server won't force-synthesize cached sentences, causing missing endings.
- Connection timeout: Waiting more than 23 seconds after the last continue-task triggers a timeout.
- Billing issues: Usage information may be inaccurate.
header parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| header.action | string | Yes | Fixed value: "finish-task". |
| header.task_id | string | Yes | Must match the task_id from run-task. |
| header.streaming | string | Yes | Fixed value: "duplex". |
payload parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| payload.input | object | Yes | Fixed value: {}. |
Events (server to client)
Events are JSON messages from the server. Each marks a stage in the task lifecycle.
Binary audio is sent separately -- not included in any event.
1. task-started event: Task started
Confirms the task has started. Send continue-task or finish-task only after receiving this event. Otherwise, the task fails.
The task-started event's payload is empty.
Example:
header parameters:
| Parameter | Type | Description |
|---|---|---|
| header.event | string | Fixed value: "task-started". |
| header.task_id | string | Task ID generated by the client. |
2. result-generated event
While you send continue-task and finish-task instructions, the server returns result-generated events and binary audio frames.
Each result-generated event contains the current sentence index. Audio data arrives as binary frames between events. One sentence produces multiple binary audio frames. Receive frames in order and append to the same file.
Example:
header parameters:
| Parameter | Type | Description |
|---|---|---|
| header.event | string | Fixed value: "result-generated". |
| header.task_id | string | Task ID generated by the client. |
| header.attributes | object | Additional attributes -- usually empty. |
payload parameters:
| Parameter | Type | Description |
|---|---|---|
| payload.output.sentence.index | integer | Sentence number, starting from 0. |
| payload.output.sentence.words | array | Array of word information. |
| payload.output.sentence.words.text | string | Word text. |
| payload.output.sentence.words.begin_index | integer | Starting position of the word in the sentence, counting from 0. |
| payload.output.sentence.words.end_index | integer | Ending position of the word in the sentence, counting from 1. |
| payload.output.sentence.words.begin_time | integer | Start timestamp of the word's audio, in milliseconds. |
| payload.output.sentence.words.end_time | integer | End timestamp of the word's audio, in milliseconds. |
| payload.usage.characters | integer | Cumulative billed characters so far. The usage field appears in some result-generated events. Use the last occurrence. |
3. task-finished event: Task finished
Marks the end of the task.
After the task ends, close the WebSocket connection or reuse it to send a new run-task instruction (see Connection overhead and reuse).
Example:
header parameters:
| Parameter | Type | Description |
|---|---|---|
| header.event | string | Fixed value: "task-finished". |
| header.task_id | string | Task ID generated by the client. |
| header.attributes.request_uuid | string | Request ID. Provide this to CosyVoice developers for diagnosis. |
4. task-failed event: Task failed
Indicates the task has failed. Close the WebSocket connection and review the error message.
Example:
header parameters:
| Parameter | Type | Description |
|---|---|---|
| header.event | string | Fixed value: "task-failed". |
| header.task_id | string | Task ID generated by the client. |
| header.error_code | string | Error type. |
| header.error_message | string | Detailed error reason. |
Task interruption
During streaming synthesis, you can interrupt the current task early (for example, if the user cancels playback) using one of these methods:
| Interrupt method | Server behavior | Use case |
|---|---|---|
| Close the connection | Stops synthesis immediately. Discards unsent audio. No task-finished event. Connection cannot be reused. | Immediate stop: User cancels playback, switches content, or exits app. |
| Send finish-task | Forces synthesis of cached text. Returns remaining audio and task-finished event. Connection stays reusable. | Graceful end: Stop sending text but receive all cached audio. |
Connection overhead and reuse
The WebSocket service supports connection reuse.
Send run-task to start a task and finish-task to end it. After task-finished, reuse the same connection by sending a new run-task instruction.
- Send a new run-task only after receiving task-finished.
- Use different task_ids for different tasks on the same connection.
- Failed tasks trigger task-failed and close the connection (cannot reuse).
- Connections timeout after 60 seconds of inactivity.
Performance and concurrency
Concurrency limits
See Rate limits.
To increase your concurrency quota, contact customer support. Quota adjustments require review and typically take 1 to 3 business days.
Best practice: Reuse a WebSocket connection for multiple tasks. See Connection overhead and reuse.
Connection latency
Typical connection time:
- Cross-border connections: 1 to 3 seconds. In rare cases, 10 to 30 seconds.
- Network latency: Check cross-border connection quality or ISP performance.
- Slow DNS: Try public DNS (8.8.8.8) or configure a local hosts file for dashscope-intl.aliyuncs.com.
- TLS handshake: Update to TLS 1.2 or later.
- Proxy/firewall: Corporate networks may block or slow WebSocket connections.
- Use Wireshark or tcpdump to analyze TCP handshake, TLS handshake, and WebSocket Upgrade timing.
- Test HTTP latency with curl:
curl -w "@curl-format.txt" -o /dev/null -s https://dashscope-intl.aliyuncs.com
Audio generation speed
- Real-time factor (RTF): 0.1 to 0.5x real-time (1 second of audio takes 0.1 to 0.5 seconds to generate). Actual speed varies by model, text length, and server load.
- First packet latency: 200 to 800 ms from sending continue-task to receiving the first audio chunk.
Example code
Basic connectivity example. Implement production-ready logic for your use case. Use asynchronous programming to send and receive simultaneously:
- Connect: Call your WebSocket library's connect function with Headers and URL.
-
Listen for messages: The server sends binary audio and events:
Events:
- task-started: Task started. Send continue-task or finish-task only after this.
- result-generated: Returned continuously after you send continue-task or finish-task.
- task-finished: Task complete. Close connection.
- task-failed: Task failed. Close connection and check error.
- For MP3/Opus streaming: Use a streaming player (FFmpeg, PyAudio, AudioFormat, MediaSource). Do not play frame by frame.
- To save complete audio: Write frames to the same file in append mode.
- For WAV/MP3: Only the first frame has header info; subsequent frames are audio data only.
- Send instructions: From a separate thread, send instructions to the server.
- Close connection: Close when done, on error, or after task-finished/task-failed.
- Go
- C#
- PHP
- Node.js
- Java
- Python