Skip to main content
CosyVoice

CosyVoice WebSocket API

CosyVoice WebSocket API

Parameters and protocol for CosyVoice text to speech over WebSocket. The DashScope SDK supports Java and Python only -- use WebSocket for other languages. User guide: For model overviews and voice selection, see Speech synthesis. WebSocket enables full-duplex communication. The client and server establish a persistent connection with a single handshake, then push data to each other in real time. Common WebSocket libraries:
  • Go: gorilla/websocket
  • PHP: Ratchet
  • Node.js: ws
CosyVoice models support only WebSocket -- not HTTP REST APIs. HTTP requests (POST, GET) return InvalidParameter or URL errors.

Prerequisites

Get an API key.

Models and pricing

See Speech synthesis.

Text and format limits

Text length limits

Send up to 20,000 characters per continue-task instruction. The total across all continue-task instructions must not exceed 200,000 characters.

Character counting rules

  • Chinese characters (simplified, traditional, Japanese Kanji, Korean Hanja) count as two characters. All others (punctuation, letters, numbers, Kana/Hangul) count as one.
  • SSML tags are excluded from the character count.
  • Examples:
    • "你好" → 2 + 2 = 4 characters
    • "中A文123" → 2 + 1 + 2 + 1 + 1 + 1 = 8 characters
    • "中文。" → 2 + 2 + 1 = 5 characters
    • "中 文。" → 2 + 1 + 2 + 1 = 6 characters
    • "<speak>你好</speak>" → 2 + 2 = 4 characters

Encoding format

Use UTF-8 encoding.

Math expression support

Math expression parsing is available for cosyvoice-v3-flash and cosyvoice-v3-plus. It covers common primary and secondary school math including basic operations, algebra, and geometry.
This feature supports Chinese only.
See Convert LaTeX formulas to speech (Chinese only).

SSML support

SSML requires all of the following:
  1. Model: Only cosyvoice-v3-flash and cosyvoice-v3-plus support SSML.
  2. Voice: Use an SSML-enabled voice:
    • All cloned voices (created through the Voice Cloning API).
    • System voices marked as SSML-enabled in the voice list.
    System voices without SSML support (such as some basic voices) return the error "SSML text is not supported at the moment!" even with enable_ssml enabled.
  3. Parameter: Set enable_ssml to true in the run-task instruction.
Then send SSML-formatted text through the continue-task instruction. For a complete example, see Getting started.

Interaction flow

Interaction flow
Client-to-server messages are instructions. Server-to-client messages are JSON events or binary audio streams. The sequence:
  1. Open a WebSocket connection.
  2. Send the run-task instruction to start a task.
  3. Wait for the task-started event before proceeding.
  4. Send text: Send one or more continue-task instructions in order. After receiving a complete sentence, the server returns a result-generated event and the audio stream. For text length constraints, see the text field in the continue-task instruction.
    Send multiple continue-task instructions to submit text fragments in order. The server segments text into sentences automatically:
    • Complete sentences are synthesized immediately.
    • Incomplete sentences are buffered until complete. No audio is returned for incomplete sentences.
    After receiving the finish-task instruction, the server force-synthesizes all buffered content.
  5. Receive the audio stream through the binary channel.
  6. After sending all text, send the finish-task instruction. Continue receiving the audio stream. Do not skip this step, or the ending portion of the audio may be lost.
  7. Receive the task-finished event from the server.
  8. Close the WebSocket connection.
Reuse a WebSocket connection for multiple tasks instead of creating a new connection each time. See Connection overhead and reuse. Keep the task_id consistent: run-task, all continue-task, and finish-task instructions in a single task must use the same task_id.Mismatched task_ids cause:
  • Disordered audio delivery.
  • Misaligned speech content.
  • Abnormal task state, possibly preventing receipt of the task-finished event.
  • Billing failures or inaccurate usage statistics.
Best practice:
  • Generate a unique task_id (for example, UUID) when sending run-task.
  • Store the task_id in a variable.
  • Use this task_id for all subsequent continue-task and finish-task instructions.
  • After receiving task-finished, generate a new task_id for the next task.

Client implementation tips

Server and client responsibilities

Server responsibilities The server delivers the complete audio stream in order. You do not need to handle audio ordering or completeness. Client responsibilities
  1. Read and concatenate all audio chunks The server delivers audio as multiple binary frames. Receive all frames and concatenate them:
# Python: Concatenate audio chunks
with open("output.mp3", "ab") as f:  # Append mode
  f.write(audio_chunk)  # audio_chunk is each received binary audio chunk
// JavaScript: Concatenate audio chunks
const audioChunks = [];
ws.onmessage = (event) => {
  if (event.data instanceof Blob) {
    audioChunks.push(event.data);  // Collect all audio chunks
  }
};
// Merge audio after task completes
const audioBlob = new Blob(audioChunks, { type: 'audio/mp3' });
  1. Maintain a complete WebSocket lifecycle Do not disconnect during the task, from sending run-task to receiving task-finished. Common mistakes:
    • Closing the connection before all audio chunks arrive, resulting in incomplete audio.
    • Forgetting to send finish-task, leaving text buffered and unprocessed.
    • Failing to handle WebSocket keepalive during page navigation or app backgrounding.
    Mobile apps (Flutter, iOS, Android) need special network handling when entering the background. Maintain the WebSocket connection in a background task or service, or reinitialize it when returning to the foreground.
  2. Text integrity in ASR-to-LLM-to-TTS workflows Ensure the text passed to TTS is complete:
    • Wait for the LLM to generate a full sentence before sending continue-task, rather than streaming character-by-character.
    • For streaming synthesis, send text at natural sentence boundaries (periods, question marks).
    • After the LLM finishes generating, always send finish-task to avoid missing trailing content.

Platform-specific tips

  • Flutter: Close the connection in the dispose method to prevent memory leaks when using web_socket_channel. Handle app lifecycle events (such as AppLifecycleState.paused) for background transitions.
  • Web (browser): Some browsers limit WebSocket connections. Reuse a single connection for multiple tasks. Use beforeunload to close the connection before the page closes.
  • Mobile (iOS/Android native): The OS may pause or terminate network connections when the app enters the background. Use a background task or foreground service to keep the WebSocket active, or reinitialize the task on foreground return.

URL

wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference
Common URL errors:
  • Wrong protocol: Use wss://, not http:// or https://.
  • Auth in query string: Do not put Authorization in the URL (such as ?Authorization=bearer YOUR_API_KEY). Set it in the HTTP handshake headers. See Headers.
  • Extra path segments: Do not append model names or other parameters to the URL. Specify the model in payload.model in the run-task instruction.

Headers

ParameterTypeRequiredDescription
AuthorizationstringYesAuthentication token. Format: Bearer $DASHSCOPE_API_KEY.
user-agentstringNoClient identifier for source tracking.
X-DashScope-WorkSpacestringNoYour Qwen Code workspace ID.
X-DashScope-DataInspectionstringNoData compliance inspection. Default: enable. Do not set unless necessary.
Authentication timingAuthentication occurs during the WebSocket handshake, not when sending run-task. If the Authorization header is missing or invalid, the server rejects the handshake with an HTTP 401 or 403 error. Client libraries typically report this as a WebSocketBadStatus exception.

Troubleshoot authentication failures

If the WebSocket connection fails:
  1. Check API key format: Confirm the Authorization header uses bearer YOUR_API_KEY with a space between bearer and the key.
  2. Verify API key validity: Check your API keys page to confirm the key is active and authorized for CosyVoice models.
  3. Check header placement: Set the Authorization header during the WebSocket handshake. Examples by language:
    • Python (websockets): extra_headers={"Authorization": f"bearer {api_key}"}
    • JavaScript: The browser WebSocket API does not support custom headers. Use a server-side proxy or another library such as ws.
    • Go (gorilla/websocket): header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))
  4. Test network connectivity: Use curl or Postman to verify the API key by calling other HTTP-supported DashScope APIs.

Using WebSocket in browsers

The browser new WebSocket(url) API does not support custom request headers (including Authorization) during the handshake. You cannot authenticate directly from frontend code. Solution: Use a backend proxy
  1. Connect to CosyVoice from your backend (Node.js, Java, or Python), where you can set the Authorization header.
  2. Have the frontend connect to your backend via WebSocket, which forwards messages to CosyVoice.
  3. This keeps the API key hidden and lets you add authentication, logging, or rate limiting.
Never hardcode your API key in frontend code. A leaked API key can lead to account compromise, unexpected charges, or data breaches. Example code: For other languages, implement the same logic or use AI tools to convert these examples.

Instructions (client to server)

Instructions are JSON messages sent as WebSocket text frames. They control the task lifecycle. Send instructions in this order:
  1. Send run-task
  2. Send continue-task
    • Sends text to synthesize.
    • Send only after receiving task-started.
  3. Send finish-task
    • Ends the task.
    • Send after all continue-task instructions are sent.

1. run-task instruction: Start a task

Starts a text to speech task. Configure voice, sample rate, and other parameters here.
  • Timing: Send after the WebSocket connection is established.
  • Do not send text here. Send text using continue-task instead.
  • The input field is required but must be {}. Omitting it causes the "task can not be null" error.
Example:
{
  "header": {
    "action": "run-task",
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "streaming": "duplex"
  },
  "payload": {
    "task_group": "audio",
    "task": "tts",
    "function": "SpeechSynthesizer",
    "model": "cosyvoice-v3-flash",
    "parameters": {
      "text_type": "PlainText",
      "voice": "longanyang",
      "format": "mp3",
      "sample_rate": 22050,
      "volume": 50,
      "rate": 1,
      "pitch": 1
    },
    "input": {}
  }
}
header parameters:
ParameterTypeRequiredDescription
header.actionstringYesFixed value: "run-task".
header.task_idstringYesA 32-character UUID. Hyphens are optional (such as "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx" or "2bf83b9abaeb4fda8d9axxxxxxxxxxxx"). Most languages provide built-in UUID APIs.
header.streamingstringYesFixed value: "duplex".
Python example for generating a task ID:
import uuid

def generateTaskId(self):
  # Generate random UUID
  return uuid.uuid4().hex
Use the same task_id for all subsequent continue-task and finish-task instructions. payload parameters:
ParameterTypeRequiredDescription
payload.task_groupstringYesFixed value: "audio".
payload.taskstringYesFixed value: "tts".
payload.functionstringYesFixed value: "SpeechSynthesizer".
payload.modelstringYesThe text to speech model. See Voice list.
payload.inputobjectYesRequired but must be empty ({}) in run-task. Send text using continue-task.
Common error: Omitting the input field or adding unexpected fields (like mode or content) causes "InvalidParameter: task can not be null" or connection close (WebSocket code 1007). payload.parameters:
ParameterTypeRequiredDescription
text_typestringYesFixed value: "PlainText".
voicestringYesVoice for synthesis. See Voice list for available system voices.
formatstringNoAudio format. Supports pcm, wav, mp3 (default), and opus. For opus, adjust bitrate with bit_rate.
sample_rateintegerNoSample rate in Hz. Default: 22050. Valid values: 8000, 16000, 22050, 24000, 44100, 48000.
volumeintegerNoVolume. Default: 50. Range: [0, 100]. Scales linearly. 0 is silent, 100 is maximum.
ratefloatNoSpeech rate. Default: 1.0. Range: [0.5, 2.0]. Below 1.0 slows speech; above 1.0 speeds it up.
pitchfloatNoPitch multiplier. Default: 1.0. Range: [0.5, 2.0]. The relationship with perceived pitch is not strictly linear. Test to find a suitable value.
enable_ssmlbooleanNoEnable SSML. When true, only one continue-task instruction is allowed.
bit_rateintNoAudio bitrate in kbps (for Opus format). Default: 32. Range: [6, 510].
word_timestamp_enabledbooleanNoEnable word-level timestamps. Default: false. Available for system voices marked as supported in the voice list.
seedintNoRandom seed for generation. Same seed with identical parameters reproduces the same output. Default: 0. Range: [0, 65535].
language_hintsarray[string]NoTarget language for synthesis. Valid values: zh, en, fr, de, ja, ko, ru, pt, th, id, vi. This is an array, but only the first element is processed.
instructionstringNoControls synthesis effects such as dialect, emotion, or speaking style. Available for system voices marked as supporting Instruct in the voice list. Max length: 100 characters.
enable_aigc_tagbooleanNoAdd an invisible AIGC identifier to generated audio. When true, an identifier is embedded in WAV, MP3, and Opus formats. Default: false. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus.
aigc_propagatorstringNoSets the ContentPropagator field in the AIGC identifier. Takes effect only when enable_aigc_tag is true. Default: UID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus.
aigc_propagate_idstringNoSets the PropagateID field in the AIGC identifier. Takes effect only when enable_aigc_tag is true. Default: the current request ID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus.
hot_fixobjectNoText hotpatching configuration. Customize pronunciation or replace text before synthesis. Available only for cosyvoice-v3-flash.
enable_markdown_filterbooleanNoEnable Markdown filtering. Removes Markdown symbols from input text before synthesis. Default: false. Available only for cosyvoice-v3-flash.
When word_timestamp_enabled is enabled, timestamps appear in the result-generated event:
{
  "header": {
    "task_id": "3f39be22-efbd-4844-91d5-xxxxxxxxxxxx",
    "event": "result-generated",
    "attributes": {}
  },
  "payload": {
    "output": {
      "sentence": {
        "index": 0,
        "words": [
          {
            "text": "bed",
            "begin_index": 0,
            "end_index": 1,
            "begin_time": 280,
            "end_time": 640
          }
        ]
      }
    }
  }
}
Instruction examples for cosyvoice-v3-flash with cloned voices:
Please speak in Cantonese. (Supported dialects: Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan.)
Please say a sentence as loudly as possible.
Please say a sentence as slowly as possible.
Please say a sentence as quickly as possible.
Please say a sentence very softly.
Can you speak a little slower?
Can you speak very quickly?
Can you speak very slowly?
Can you speak a little faster?
Please say a sentence very angrily.
Please say a sentence very happily.
Please say a sentence very fearfully.
Please say a sentence very sadly.
Please say a sentence very surprisedly.
Please try to sound as firm as possible.
Please try to sound as angry as possible.
Please try an approachable tone.
Please speak in a cold tone.
Please speak in a majestic tone.
I want to experience a natural tone.
I want to see how you express a threat.
I want to see how you express wisdom.
I want to see how you express seduction.
I want to hear you speak in a lively way.
I want to hear you speak with passion.
I want to hear you speak in a steady manner.
I want to hear you speak with confidence.
Can you talk to me with excitement?
Can you show an arrogant emotion?
Can you show an elegant emotion?
Can you answer the question happily?
Can you give a gentle emotional demonstration?
Can you talk to me in a calm tone?
Can you answer me in a deep way?
Can you talk to me with a gruff attitude?
Tell me the answer in a sinister voice.
Tell me the answer in a resilient voice.
Narrate in a natural and friendly chat style.
Speak in the tone of a radio drama podcaster.
For cosyvoice-v3-flash with system voices, the instruction must use a fixed format. See the voice list. hot_fix example:
"hot_fix": {
  "pronunciation": [
  {"weather": "tian1 qi4"}
  ],
  "replace": [
  {"today": "jin1 tian1"}
  ]
}

2. continue-task instruction

Sends text to synthesize. Send all text in one instruction, or split it across multiple instructions in order. When to send: After receiving task-started.
Do not wait longer than 23 seconds between text fragments, or a "request timeout after 23 seconds" error occurs. If no more text remains, send finish-task to end the task. The 23-second timeout is server-enforced and cannot be modified.
Example:
{
  "header": {
    "action": "continue-task",
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "streaming": "duplex"
  },
  "payload": {
    "input": {
      "text": "Before my bed, moonlight shines bright, I suspect it's frost upon the ground."
    }
  }
}
header parameters:
ParameterTypeRequiredDescription
header.actionstringYesFixed value: "continue-task".
header.task_idstringYesMust match the task_id from run-task.
header.streamingstringYesFixed value: "duplex".
payload parameters:
ParameterTypeRequiredDescription
input.textstringYesText to synthesize.

3. finish-task instruction: End task

Ends the task. Always send this instruction. Otherwise:
  • Incomplete audio: The server won't force-synthesize cached sentences, causing missing endings.
  • Connection timeout: Waiting more than 23 seconds after the last continue-task triggers a timeout.
  • Billing issues: Usage information may be inaccurate.
When to send: Send immediately after all continue-task instructions. Do not wait for audio to finish -- this may trigger timeouts. Example:
{
  "header": {
    "action": "finish-task",
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "streaming": "duplex"
  },
  "payload": {
    "input": {}
  }
}
header parameters:
ParameterTypeRequiredDescription
header.actionstringYesFixed value: "finish-task".
header.task_idstringYesMust match the task_id from run-task.
header.streamingstringYesFixed value: "duplex".
payload parameters:
ParameterTypeRequiredDescription
payload.inputobjectYesFixed value: {}.

Events (server to client)

Events are JSON messages from the server. Each marks a stage in the task lifecycle.
Binary audio is sent separately -- not included in any event.

1. task-started event: Task started

Confirms the task has started. Send continue-task or finish-task only after receiving this event. Otherwise, the task fails. The task-started event's payload is empty. Example:
{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "task-started",
    "attributes": {}
  },
  "payload": {}
}
header parameters:
ParameterTypeDescription
header.eventstringFixed value: "task-started".
header.task_idstringTask ID generated by the client.

2. result-generated event

While you send continue-task and finish-task instructions, the server returns result-generated events and binary audio frames. Each result-generated event contains the current sentence index. Audio data arrives as binary frames between events. One sentence produces multiple binary audio frames. Receive frames in order and append to the same file. Example:
{
  "header": {
    "task_id": "3f2d5c86-0550-45c0-801f-xxxxxxxxxx",
    "event": "result-generated",
    "attributes": {}
  },
  "payload": {
    "output": {
      "sentence": {
        "index": 0,
        "words": []
      }
    },
    "usage": {
      "characters": 11
    }
  }
}
header parameters:
ParameterTypeDescription
header.eventstringFixed value: "result-generated".
header.task_idstringTask ID generated by the client.
header.attributesobjectAdditional attributes -- usually empty.
payload parameters:
ParameterTypeDescription
payload.output.sentence.indexintegerSentence number, starting from 0.
payload.output.sentence.wordsarrayArray of word information.
payload.output.sentence.words.textstringWord text.
payload.output.sentence.words.begin_indexintegerStarting position of the word in the sentence, counting from 0.
payload.output.sentence.words.end_indexintegerEnding position of the word in the sentence, counting from 1.
payload.output.sentence.words.begin_timeintegerStart timestamp of the word's audio, in milliseconds.
payload.output.sentence.words.end_timeintegerEnd timestamp of the word's audio, in milliseconds.
payload.usage.charactersintegerCumulative billed characters so far. The usage field appears in some result-generated events. Use the last occurrence.

3. task-finished event: Task finished

Marks the end of the task. After the task ends, close the WebSocket connection or reuse it to send a new run-task instruction (see Connection overhead and reuse). Example:
{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "task-finished",
    "attributes": {
      "request_uuid": "0a9dba9e-d3a6-45a4-be6d-xxxxxxxxxxxx"
    }
  },
  "payload": {
    "output": {}
  }
}
header parameters:
ParameterTypeDescription
header.eventstringFixed value: "task-finished".
header.task_idstringTask ID generated by the client.
header.attributes.request_uuidstringRequest ID. Provide this to CosyVoice developers for diagnosis.

4. task-failed event: Task failed

Indicates the task has failed. Close the WebSocket connection and review the error message. Example:
{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "task-failed",
    "error_code": "InvalidParameter",
    "error_message": "[tts:]Engine return error code: 418",
    "attributes": {}
  },
  "payload": {}
}
header parameters:
ParameterTypeDescription
header.eventstringFixed value: "task-failed".
header.task_idstringTask ID generated by the client.
header.error_codestringError type.
header.error_messagestringDetailed error reason.

Task interruption

During streaming synthesis, you can interrupt the current task early (for example, if the user cancels playback) using one of these methods:
Interrupt methodServer behaviorUse case
Close the connectionStops synthesis immediately. Discards unsent audio. No task-finished event. Connection cannot be reused.Immediate stop: User cancels playback, switches content, or exits app.
Send finish-taskForces synthesis of cached text. Returns remaining audio and task-finished event. Connection stays reusable.Graceful end: Stop sending text but receive all cached audio.

Connection overhead and reuse

The WebSocket service supports connection reuse. Send run-task to start a task and finish-task to end it. After task-finished, reuse the same connection by sending a new run-task instruction.
  1. Send a new run-task only after receiving task-finished.
  2. Use different task_ids for different tasks on the same connection.
  3. Failed tasks trigger task-failed and close the connection (cannot reuse).
  4. Connections timeout after 60 seconds of inactivity.

Performance and concurrency

Concurrency limits

See Rate limits. To increase your concurrency quota, contact customer support. Quota adjustments require review and typically take 1 to 3 business days.
Best practice: Reuse a WebSocket connection for multiple tasks. See Connection overhead and reuse.

Connection latency

Typical connection time:
  • Cross-border connections: 1 to 3 seconds. In rare cases, 10 to 30 seconds.
Troubleshoot slow connections (>30 seconds):
  1. Network latency: Check cross-border connection quality or ISP performance.
  2. Slow DNS: Try public DNS (8.8.8.8) or configure a local hosts file for dashscope-intl.aliyuncs.com.
  3. TLS handshake: Update to TLS 1.2 or later.
  4. Proxy/firewall: Corporate networks may block or slow WebSocket connections.
Troubleshooting tools:
  • Use Wireshark or tcpdump to analyze TCP handshake, TLS handshake, and WebSocket Upgrade timing.
  • Test HTTP latency with curl: curl -w "@curl-format.txt" -o /dev/null -s https://dashscope-intl.aliyuncs.com

Audio generation speed

  • Real-time factor (RTF): 0.1 to 0.5x real-time (1 second of audio takes 0.1 to 0.5 seconds to generate). Actual speed varies by model, text length, and server load.
  • First packet latency: 200 to 800 ms from sending continue-task to receiving the first audio chunk.

Example code

Basic connectivity example. Implement production-ready logic for your use case. Use asynchronous programming to send and receive simultaneously:
  1. Connect: Call your WebSocket library's connect function with Headers and URL.
  2. Listen for messages: The server sends binary audio and events: Events:
    • task-started: Task started. Send continue-task or finish-task only after this.
    • result-generated: Returned continuously after you send continue-task or finish-task.
    • task-finished: Task complete. Close connection.
    • task-failed: Task failed. Close connection and check error.
    Binary audio:
    • For MP3/Opus streaming: Use a streaming player (FFmpeg, PyAudio, AudioFormat, MediaSource). Do not play frame by frame.
    • To save complete audio: Write frames to the same file in append mode.
    • For WAV/MP3: Only the first frame has header info; subsequent frames are audio data only.
  3. Send instructions: From a separate thread, send instructions to the server.
  4. Close connection: Close when done, on error, or after task-finished/task-failed.
  • Go
  • C#
  • PHP
  • Node.js
  • Java
  • Python
package main

import (
  "encoding/json"
  "fmt"
  "net/http"
  "os"
  "strings"
  "time"

  "github.com/google/uuid"
  "github.com/gorilla/websocket"
)

const (
  wsURL      = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/"
  outputFile = "output.mp3"
)

func main() {
  // If no environment variable is set, replace next line with: apiKey := "YOUR_API_KEY"
  apiKey := os.Getenv("DASHSCOPE_API_KEY")

  // Clear output file
  os.Remove(outputFile)
  os.Create(outputFile)

  // Connect WebSocket
  header := make(http.Header)
  header.Add("X-DashScope-DataInspection", "enable")
  header.Add("Authorization", fmt.Sprintf("bearer %s", apiKey))

  conn, resp, err := websocket.DefaultDialer.Dial(wsURL, header)
  if err != nil {
    if resp != nil {
      fmt.Printf("Connection failed HTTP status code: %d\n", resp.StatusCode)
    }
    fmt.Println("Connection failed:", err)
    return
  }
  defer conn.Close()

  // Generate task ID
  taskID := uuid.New().String()
  fmt.Printf("Generated task ID: %s\n", taskID)

  // Send run-task instruction
  runTaskCmd := map[string]interface{}{
    "header": map[string]interface{}{
      "action":    "run-task",
      "task_id":   taskID,
      "streaming": "duplex",
    },
    "payload": map[string]interface{}{
      "task_group": "audio",
      "task":       "tts",
      "function":   "SpeechSynthesizer",
      "model":      "cosyvoice-v3-flash",
      "parameters": map[string]interface{}{
        "text_type":   "PlainText",
        "voice":       "longanyang",
        "format":      "mp3",
        "sample_rate": 22050,
        "volume":      50,
        "rate":        1,
        "pitch":       1,
        // If enable_ssml is true, only one continue-task instruction is allowed. Otherwise, it returns “Text request limit violated, expected 1.”
        "enable_ssml": false,
      },
      "input": map[string]interface{}{},
    },
  }

  runTaskJSON, _ := json.Marshal(runTaskCmd)
  fmt.Printf("Sent run-task instruction: %s\n", string(runTaskJSON))

  err = conn.WriteMessage(websocket.TextMessage, runTaskJSON)
  if err != nil {
    fmt.Println("Failed to send run-task:", err)
    return
  }

  textSent := false

  // Process messages
  for {
    messageType, message, err := conn.ReadMessage()
    if err != nil {
      fmt.Println("Failed to read message:", err)
      break
    }

    // Process binary message
    if messageType == websocket.BinaryMessage {
      fmt.Printf("Received binary message, length: %d\n", len(message))
      file, _ := os.OpenFile(outputFile, os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0644)
      file.Write(message)
      file.Close()
      continue
    }

    // Process text message
    messageStr := string(message)
    fmt.Printf("Received text message: %s\n", strings.ReplaceAll(messageStr, "\n", ""))

    // Simple JSON parse to get event type
    var msgMap map[string]interface{}
    if json.Unmarshal(message, &msgMap) == nil {
      if header, ok := msgMap["header"].(map[string]interface{}); ok {
        if event, ok := header["event"].(string); ok {
          fmt.Printf("Event type: %s\n", event)

          switch event {
          case "task-started":
            fmt.Println("=== Received task-started event ===")

            if !textSent {
              // Send continue-task instruction

              texts := []string{"Before my bed, moonlight shines bright, I suspect it's frost upon the ground.", "I raise my eyes to gaze at the bright moon, then bow my head, thinking of home."}

              for _, text := range texts {
                continueTaskCmd := map[string]interface{}{
                  "header": map[string]interface{}{
                    "action":    "continue-task",
                    "task_id":   taskID,
                    "streaming": "duplex",
                  },
                  "payload": map[string]interface{}{
                    "input": map[string]interface{}{
                      "text": text,
                    },
                  },
                }

                continueTaskJSON, _ := json.Marshal(continueTaskCmd)
                fmt.Printf("Sent continue-task instruction: %s\n", string(continueTaskJSON))

                err = conn.WriteMessage(websocket.TextMessage, continueTaskJSON)
                if err != nil {
                  fmt.Println("Failed to send continue-task:", err)
                  return
                }
              }

              textSent = true

              // Delay before sending finish-task
              time.Sleep(500 * time.Millisecond)

              // Send finish-task instruction
              finishTaskCmd := map[string]interface{}{
                "header": map[string]interface{}{
                  "action":    "finish-task",
                  "task_id":   taskID,
                  "streaming": "duplex",
                },
                "payload": map[string]interface{}{
                  "input": map[string]interface{}{},
                },
              }

              finishTaskJSON, _ := json.Marshal(finishTaskCmd)
              fmt.Printf("Sent finish-task instruction: %s\n", string(finishTaskJSON))

              err = conn.WriteMessage(websocket.TextMessage, finishTaskJSON)
              if err != nil {
                fmt.Println("Failed to send finish-task:", err)
                return
              }
            }

          case "task-finished":
            fmt.Println("=== Task completed ===")
            return

          case "task-failed":
            fmt.Println("=== Task failed ===")
            if header["error_message"] != nil {
              fmt.Printf("Error message: %s\n", header["error_message"])
            }
            return

          case "result-generated":
            fmt.Println("Received result-generated event")
          }
        }
      }
    }
  }
}

CosyVoice WebSocket API | Qwen Cloud