Skip to main content
CosyVoice

CosyVoice Python SDK

CosyVoice Python reference

User guide: For model overviews and voice selection, see Text-to-speech models.

Prerequisites

For temporary access to third-party apps or users, or to control high-risk operations like accessing or deleting sensitive data, use a temporary authentication token.Temporary tokens expire in 60 seconds, reducing leakage risk compared to long-term API keys. Replace the API key in your authentication code with the temporary token.

Models and pricing

See Text-to-speech models.

Text and format limits

Text length limits

Character counting rules

  • Chinese characters (simplified/traditional, Japanese Kanji, Korean Hanja) count as 2. All other characters (punctuation, letters, numbers, Kana, Hangul) count as 1.
  • SSML tags are excluded from the character count.
  • Examples:
    • "你好" → 2 (Chinese character) + 2 (Chinese character) = 4 characters
    • "中A文123" → 2 (Chinese character) + 1 (A) + 2 (Chinese character) + 1 (1) + 1 (2) + 1 (3) = 8 characters
    • "中文。" → 2 (Chinese character) + 2 (Chinese character) + 1 (.) = 5 characters
    • "中 文。" → 2 (Chinese character) + 1 (space) + 2 (Chinese character) + 1 (.) = 6 characters
    • "<speak>你好</speak>" → 2 (Chinese character) + 2 (Chinese character) = 4 characters

Encoding format

Use UTF-8 encoding.

Math expression support

Math expression parsing (basic operations, algebra, geometry) is available for cosyvoice-v3-flash and cosyvoice-v3-plus. This feature supports primary and secondary school level expressions.
This feature only supports Chinese.
See Convert LaTeX formulas to speech (Chinese language only).

SSML support

SSML is available for custom voices (voice design/cloning) using cosyvoice-v3-flash and v3-plus, plus system voices marked as supported in the voice list.

Getting started

The SpeechSynthesizer class supports three call methods:
  • Non-streaming: Blocking call. Sends full text, returns complete audio. Best for short text.
  • Unidirectional streaming: Non-blocking. Sends full text, receives audio via callback. Best for short text with low latency.
  • Bidirectional streaming: Non-blocking. Sends text fragments incrementally, receives audio in real time via callback. Best for long text with low latency.

Non-streaming

Send full text at once without a callback. Returns complete audio in one response.
image
Instantiate SpeechSynthesizer with request parameters, then call call to get binary audio. Maximum 20,000 characters (see SpeechSynthesizer call method).
Re-initialize the SpeechSynthesizer instance before each call.
# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *
import os

# If you have not configured environment variables, replace the next line with your API key: dashscope.api_key = "YOUR_API_KEY"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send text for synthesis and get binary audio
audio = synthesizer.call("How is the weather today?")
# First request includes WebSocket connection setup time
print('[Metric] requestId: {}, first-package delay: {} ms'.format(
  synthesizer.get_last_request_id(),
  synthesizer.get_first_package_delay()))

# Save audio to local file
with open('output.mp3', 'wb') as f:
  f.write(audio)

Unidirectional streaming

Send full text, stream audio via ResultCallback. Receive results in real time.
image
Instantiate SpeechSynthesizer with request parameters and callback (ResultCallback), then call call. Receive results via the on_data callback. Maximum 20,000 characters (see SpeechSynthesizer call method).
Re-initialize the SpeechSynthesizer instance before each call.
# coding=utf-8

import os
import dashscope
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
  now = datetime.now()
  formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
  return formatted_timestamp

# If you have not configured environment variables, replace the next line with your API key: dashscope.api_key = "YOUR_API_KEY"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define callback interface
class Callback(ResultCallback):
  _player = None
  _stream = None

  def on_open(self):
    self.file = open("output.mp3", "wb")
    print("Connection established: " + get_timestamp())

  def on_complete(self):
    print("Speech synthesis completed, all results received: " + get_timestamp())
    # Call get_first_package_delay only after on_complete triggers
    # First request includes WebSocket connection setup time
    print('[Metric] requestId: {}, first-package delay: {} ms'.format(
      synthesizer.get_last_request_id(),
      synthesizer.get_first_package_delay()))

  def on_error(self, message: str):
    print(f"Speech synthesis error: {message}")

  def on_close(self):
    print("Connection closed: " + get_timestamp())
    self.file.close()

  def on_event(self, message):
    pass

  def on_data(self, data: bytes) -> None:
    print(get_timestamp() + " Binary audio length: " + str(len(data)))
    self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(
  model=model,
  voice=voice,
  callback=callback,
)

# Send text for synthesis and retrieve binary audio in real time through the on_data method of the callback interface
synthesizer.call("How is the weather today?")

Bidirectional streaming

Submit text in multiple parts within a single task and receive results in real time through callbacks.
  • Call streaming_call multiple times to submit text fragments in order. The server auto-splits into sentences:
    • Complete sentences: synthesized immediately
    • Incomplete sentences: cached until complete, then synthesized
    Call streaming_complete to force synthesis of all remaining fragments.
  • Fragment submission interval: max 23 seconds (fixed server timeout, non-configurable). Exceeding this throws a "request timeout after 23 seconds" error. Call streaming_complete promptly when done.
image
1

Instantiate SpeechSynthesizer

2

Stream text

Call streaming_call multiple times to submit text fragments. The server returns audio in real time via the on_data callback.Each fragment must not exceed 20,000 characters, and the cumulative total must not exceed 200,000 characters.
3

Finish

Call streaming_complete to end the synthesis. This blocks until on_complete or on_error triggers.Always call this method. Otherwise, trailing text may not convert to speech.
# coding=utf-8
#
# PyAudio installation instructions:
# For macOS, run:
#   brew install portaudio
#   pip install pyaudio
# For Debian/Ubuntu, run:
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# For CentOS, run:
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# For Microsoft Windows, run:
#   python -m pip install pyaudio

import os
import time
import pyaudio
import dashscope
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts_v2 import *

from datetime import datetime

def get_timestamp():
  now = datetime.now()
  formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
  return formatted_timestamp

# If you have not configured environment variables, replace the next line with your API key: dashscope.api_key = "YOUR_API_KEY"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Model
model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define callback interface
class Callback(ResultCallback):
  _player = None
  _stream = None

  def on_open(self):
    print("Connection established: " + get_timestamp())
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(
      format=pyaudio.paInt16, channels=1, rate=22050, output=True
    )

  def on_complete(self):
    print("Speech synthesis completed, all results received: " + get_timestamp())

  def on_error(self, message: str):
    print(f"Speech synthesis error: {message}")

  def on_close(self):
    print("Connection closed: " + get_timestamp())
    # Stop player
    self._stream.stop_stream()
    self._stream.close()
    self._player.terminate()

  def on_event(self, message):
    pass

  def on_data(self, data: bytes) -> None:
    print(get_timestamp() + " Binary audio length: " + str(len(data)))
    self._stream.write(data)


callback = Callback()

test_text = [
  "Streaming text-to-speech SDK,",
  "converts input text",
  "into binary audio data.",
  "Compared with non-streaming speech synthesis,",
  "streaming synthesis offers better real-time performance.",
  "Users hear near-synchronous audio output while typing,",
  "greatly improving interaction experience",
  "and reducing wait time.",
  "Ideal for large language model (LLM) integration,",
  "where text is streamed for speech synthesis.",
]

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(
  model=model,
  voice=voice,
  format=AudioFormat.PCM_22050HZ_MONO_16BIT,
  callback=callback,
)


# Stream text for synthesis. Retrieve binary audio in real time through the on_data method of the callback interface
for text in test_text:
  synthesizer.streaming_call(text)
  time.sleep(0.1)
# End streaming speech synthesis
synthesizer.streaming_complete()

# First request includes WebSocket connection setup time
print('[Metric] requestId: {}, first-package delay: {} ms'.format(
  synthesizer.get_last_request_id(),
  synthesizer.get_first_package_delay()))

Request parameters

Set parameters via the SpeechSynthesizer constructor.
ParameterTypeRequiredDescription
modelstrYesThe model for text-to-speech. See Voice list for all options.
voicestrYesThe voice for synthesis. See Voice list for available system voices.
formatenumNoAudio format and sample rate. Default: MP3 at 22.05 kHz.

Note: The default rate is optimal for the selected voice. Downsampling and upsampling are supported.

All models support WAV, MP3, and PCM at 8/16/22.05/24/44.1/48 kHz. OPUS (OGG_OPUS) at 8/16/24/48 kHz with configurable bitrate (requires SDK 1.24.0+). See format reference.
volumeintNoVolume. Default: 50. Range: [0, 100]. Scales linearly (0 = silent, 50 = default, 100 = max).

Important: SDK 1.20.10 and later: field name is volume.
speech_ratefloatNoSpeech rate. Default: 1.0. Range: [0.5, 2.0]. Values below 1.0 slow down speech; values above 1.0 speed it up.
pitch_ratefloatNoPitch multiplier. The relationship with perceived pitch is non-linear; test to find a suitable value. Default: 1.0. Range: [0.5, 2.0]. >1.0 = higher pitch, <1.0 = lower pitch.
bit_rateintNoAudio bitrate in kbps. For Opus format, adjust with bit_rate. Default: 32. Range: [6, 510]. Set via additional_params (see example below).
word_timestamp_enabledboolNoEnable word-level timestamps. Default: False. Supports system voices marked as supported in the voice list. Timestamps are available only through the callback interface. Set via additional_params (see example below).
seedintNoRandom seed for generation. Different seeds produce different results. With identical model, text, voice, and other parameters, the same seed reproduces the same output. Default: 0. Range: [0, 65535].
language_hintslist[str]NoTarget language. Valid values: zh, en, fr, de, ja, ko, ru, pt, th, id, vi. Array parameter, but only the first element is used.
instructionstrNoControls dialect, emotion, or speaking style. Available only for system voices marked as supporting Instruct in the voice list. Max length: 100 characters. See instruction examples.
enable_aigc_tagboolNoAdd an invisible AIGC identifier to generated audio. When True, the identifier is embedded in supported formats (WAV, MP3, OPUS). Default: False. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. Set via additional_params (see example below).
aigc_propagatorstrNoSet the ContentPropagator field in the AIGC identifier. Only effective when enable_aigc_tag is True. Default: UID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. Set via additional_params.
aigc_propagate_idstrNoSet the PropagateID field in the AIGC identifier. Only effective when enable_aigc_tag is True. Default: the request ID. Supported by cosyvoice-v3-flash and cosyvoice-v3-plus. Set via additional_params.
hot_fixdictNoText hotpatching. Customize pronunciation of specific words or replace text before synthesis. Available only for cosyvoice-v3-flash. See hot_fix example.
enable_markdown_filterboolNoEnable Markdown filtering. When enabled, Markdown symbols are removed from input text before synthesis. Available only for cosyvoice-v3-flash. Default: False. Set via additional_params.
callbackResultCallbackNoCallback interface (ResultCallback).

additional_params examples

Setting bit_rate:
synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                additional_params={"bit_rate": 32})
Setting word_timestamp_enabled:
synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                callback=callback, # Timestamps are available only through the callback interface
                additional_params={'word_timestamp_enabled': True})
# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *
import json
from datetime import datetime


def get_timestamp():
  now = datetime.now()
  formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
  return formatted_timestamp


# If you have not configured your API key in an environment variable, replace YOUR_API_KEY with your actual key
# dashscope.api_key = "YOUR_API_KEY"

model = "cosyvoice-v3-flash"
# Voice
voice = "longanyang"


# Define callback interface
class Callback(ResultCallback):
  _player = None
  _stream = None

  def on_open(self):
    self.file = open("output.mp3", "wb")
    print("Connection established: " + get_timestamp())

  def on_complete(self):
    print("Speech synthesis completed, all results received: " + get_timestamp())

  def on_error(self, message: str):
    print(f"Speech synthesis error: {message}")

  def on_close(self):
    print("Connection closed: " + get_timestamp())
    self.file.close()

  def on_event(self, message):
    json_data = json.loads(message)
    if json_data['payload'] and json_data['payload']['output'] and json_data['payload']['output']['sentence']:
      sentence = json_data['payload']['output']['sentence']
      print(f'sentence: {sentence}')
      # Get sentence index
      # index = sentence['index']
      words = sentence['words']
      if words:
        for word in words:
          print(f'word: {word}')
          # Example: word: {'text': 'T', 'begin_index': 0, 'end_index': 1, 'begin_time': 80, 'end_time': 200}

  def on_data(self, data: bytes) -> None:
    print(get_timestamp() + " Binary audio length: " + str(len(data)))
    self.file.write(data)


callback = Callback()

# Instantiate SpeechSynthesizer and pass request parameters such as model and voice in the constructor
synthesizer = SpeechSynthesizer(
  model=model,
  voice=voice,
  callback=callback,
  additional_params={'word_timestamp_enabled': True}
)

# Send text for synthesis and retrieve binary audio in real time through the on_data method of the callback interface
synthesizer.call("How is the weather today?")
# First request includes WebSocket connection setup time
print('[Metric] requestId: {}, first-package delay: {} ms'.format(
  synthesizer.get_last_request_id(),
  synthesizer.get_first_package_delay()))
Setting enable_aigc_tag:
synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                additional_params={"enable_aigc_tag": True})
Setting aigc_propagator:
synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                additional_params={"enable_aigc_tag": True, "aigc_propagator": "xxxx"})
Setting aigc_propagate_id:
synthesizer = SpeechSynthesizer(model="cosyvoice-v3-flash",
                voice="longanyang",
                format=AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS,
                additional_params={"enable_aigc_tag": True, "aigc_propagate_id": "xxxx"})
Setting enable_markdown_filter:
synthesizer = SpeechSynthesizer(
  model="cosyvoice-v3-flash",
  voice="your_voice", # Replace with a cosyvoice-v3-flash cloned voice
  additional_params={"enable_markdown_filter": True}
)

hot_fix example

synthesizer = SpeechSynthesizer(
  model="cosyvoice-v3-flash",
  voice="your_voice", # Replace with a cosyvoice-v3-flash cloned voice
  hot_fix={
    "pronunciation": [{"weather": "tian1 qi4"}],
    "replace": [{"today": "jin1 tian1"}]
  }
)

Instruction examples

  • cosyvoice-v3-flash (cloned voices)
  • cosyvoice-v3-flash (system voices)
Use any natural language instruction to control synthesis effects.
Please speak in Cantonese. (Supported dialects: Cantonese, Northeastern, Gansu, Guizhou, Henan, Hubei, Jiangxi, Minnan, Ningxia, Shanxi, Shaanxi, Shandong, Shanghainese, Sichuan, Tianjin, Yunnan.)
Please say a sentence as loudly as possible.
Please say a sentence as slowly as possible.
Please say a sentence as quickly as possible.
Please say a sentence very softly.
Can you speak a little slower?
Can you speak very quickly?
Can you speak very slowly?
Can you speak a little faster?
Please say a sentence very angrily.
Please say a sentence very happily.
Please say a sentence very fearfully.
Please say a sentence very sadly.
Please say a sentence very surprisedly.
Please try to sound as firm as possible.
Please try to sound as angry as possible.
Please try an approachable tone.
Please speak in a cold tone.
Please speak in a majestic tone.
I want to experience a natural tone.
I want to see how you express a threat.
I want to see how you express wisdom.
I want to see how you express seduction.
I want to hear you speak in a lively way.
I want to hear you speak with passion.
I want to hear you speak in a steady manner.
I want to hear you speak with confidence.
Can you talk to me with excitement?
Can you show an arrogant emotion?
Can you show an elegant emotion?
Can you answer the question happily?
Can you give a gentle emotional demonstration?
Can you talk to me in a calm tone?
Can you answer me in a deep way?
Can you talk to me with a gruff attitude?
Tell me the answer in a sinister voice.
Tell me the answer in a resilient voice.
Narrate in a natural and friendly chat style.
Speak in the tone of a radio drama podcaster.

Format reference

All models support the following formats and sample rates:
  • AudioFormat.WAV_8000HZ_MONO_16BIT: WAV format, 8 kHz sample rate
  • AudioFormat.WAV_16000HZ_MONO_16BIT: WAV format, 16 kHz sample rate
  • AudioFormat.WAV_22050HZ_MONO_16BIT: WAV format, 22.05 kHz sample rate
  • AudioFormat.WAV_24000HZ_MONO_16BIT: WAV format, 24 kHz sample rate
  • AudioFormat.WAV_44100HZ_MONO_16BIT: WAV format, 44.1 kHz sample rate
  • AudioFormat.WAV_48000HZ_MONO_16BIT: WAV format, 48 kHz sample rate
  • AudioFormat.MP3_8000HZ_MONO_128KBPS: MP3 format, 8 kHz sample rate
  • AudioFormat.MP3_16000HZ_MONO_128KBPS: MP3 format, 16 kHz sample rate
  • AudioFormat.MP3_22050HZ_MONO_256KBPS: MP3 format, 22.05 kHz sample rate
  • AudioFormat.MP3_24000HZ_MONO_256KBPS: MP3 format, 24 kHz sample rate
  • AudioFormat.MP3_44100HZ_MONO_256KBPS: MP3 format, 44.1 kHz sample rate
  • AudioFormat.MP3_48000HZ_MONO_256KBPS: MP3 format, 48 kHz sample rate
  • AudioFormat.PCM_8000HZ_MONO_16BIT: PCM format, 8 kHz sample rate
  • AudioFormat.PCM_16000HZ_MONO_16BIT: PCM format, 16 kHz sample rate
  • AudioFormat.PCM_22050HZ_MONO_16BIT: PCM format, 22.05 kHz sample rate
  • AudioFormat.PCM_24000HZ_MONO_16BIT: PCM format, 24 kHz sample rate
  • AudioFormat.PCM_44100HZ_MONO_16BIT: PCM format, 44.1 kHz sample rate
  • AudioFormat.PCM_48000HZ_MONO_16BIT: PCM format, 48 kHz sample rate
For OPUS format, adjust bitrate with the bit_rate parameter. Requires DashScope SDK 1.24.0 or later:
  • AudioFormat.OGG_OPUS_8KHZ_MONO_32KBPS: OPUS format, 8 kHz sample rate, 32 kbps bitrate
  • AudioFormat.OGG_OPUS_16KHZ_MONO_16KBPS: OPUS format, 16 kHz sample rate, 16 kbps bitrate
  • AudioFormat.OGG_OPUS_16KHZ_MONO_32KBPS: OPUS format, 16 kHz sample rate, 32 kbps bitrate
  • AudioFormat.OGG_OPUS_16KHZ_MONO_64KBPS: OPUS format, 16 kHz sample rate, 64 kbps bitrate
  • AudioFormat.OGG_OPUS_24KHZ_MONO_16KBPS: OPUS format, 24 kHz sample rate, 16 kbps bitrate
  • AudioFormat.OGG_OPUS_24KHZ_MONO_32KBPS: OPUS format, 24 kHz sample rate, 32 kbps bitrate
  • AudioFormat.OGG_OPUS_24KHZ_MONO_64KBPS: OPUS format, 24 kHz sample rate, 64 kbps bitrate
  • AudioFormat.OGG_OPUS_48KHZ_MONO_16KBPS: OPUS format, 48 kHz sample rate, 16 kbps bitrate
  • AudioFormat.OGG_OPUS_48KHZ_MONO_32KBPS: OPUS format, 48 kHz sample rate, 32 kbps bitrate
  • AudioFormat.OGG_OPUS_48KHZ_MONO_64KBPS: OPUS format, 48 kHz sample rate, 64 kbps bitrate

Key interfaces

SpeechSynthesizer class

Import SpeechSynthesizer with from dashscope.audio.tts_v2 import *. This class provides the core text-to-speech interfaces.
MethodParametersReturn valueDescription
def call(self, text: str, timeout_millis=None)text: Text to synthesize. timeout_millis: Timeout in milliseconds. No effect if unset or 0.Binary audio data if no ResultCallback is set; otherwise None.Convert text (plain or SSML) to speech.

No callback: Blocks until complete, returns binary audio. See Non-streaming.

With callback: Returns None immediately, delivers results via on_data. See Unidirectional streaming.

Important: Re-initialize SpeechSynthesizer before each call.
def streaming_call(self, text: str)text: Text fragment to synthesizeNoneStream text fragments for synthesis (SSML not supported). Call multiple times to send fragments. Results arrive via on_data in ResultCallback. See Bidirectional streaming.
def streaming_complete(self, complete_timeout_millis=600000)complete_timeout_millis: Wait time in millisecondsNoneEnd streaming synthesis. Blocks for complete_timeout_millis ms until the task ends. 0 = wait indefinitely. Default: 10 minutes. See Bidirectional streaming.

Important: Always call this in bidirectional streaming to avoid missing speech.
def get_last_request_id(self)NoneRequest IDGet the request ID of the previous task.
def get_first_package_delay(self)NoneFirst-packet delay in msGet first-packet latency (time from sending text to receiving first audio packet). Call after the task completes.

Factors: WebSocket connection setup (first call), voice loading, service load, network latency.

Typical range: ~500 ms (reusing connection/voice), 1,500-2,000 ms (first connection or voice switch).

If consistently >2,000 ms: 1. Use connection pooling for high concurrency. 2. Check network quality. 3. Avoid peak hours.
def get_response(self)NoneLast message (JSON)Get the last message, useful for detecting task-failed errors.

Callback interface (ResultCallback)

In unidirectional streaming or bidirectional streaming, the server returns process information and data via callbacks. Implement these methods to handle server responses. Import using from dashscope.audio.tts_v2 import *.
class Callback(ResultCallback):
  def on_open(self) -> None:
    print('Connection successful')

  def on_data(self, data: bytes) -> None:
    # Implement logic to handle binary audio data
    pass

  def on_complete(self) -> None:
    print('Synthesis complete')

  def on_error(self, message) -> None:
    print('Error: ', message)

  def on_close(self) -> None:
    print('Connection closed')


callback = Callback()
MethodParametersReturn valueDescription
def on_open(self) -> NoneNoneNoneCalled when the client connects to the server.
def on_event(self, message: str) -> Nonemessage: Server message (JSON string)NoneCalled when the server sends a message. Parse message to get the task ID (task_id) and billed character count (characters).
def on_complete(self) -> NoneNoneNoneCalled when all audio data has been returned.
def on_error(self, message) -> Nonemessage: Error messageNoneCalled when an error occurs.
def on_data(self, data: bytes) -> Nonedata: Binary audio dataNoneCalled when synthesized audio arrives. Combine binary data into a complete file or play it with a streaming player. Important: For compressed formats (MP3, Opus), use a streaming player (FFmpeg, PyAudio, AudioFormat, MediaSource). Do not play frame by frame, as this causes decoding failures. When writing to a file, use append mode. For WAV and MP3, only the first frame contains header information.
def on_close(self) -> NoneNoneNoneCalled when the server closes the connection.

Response

The server returns binary audio data:

More examples

For more examples, see GitHub.