Speech synthesis - Qwen Cloud

POST

/api/v1/services/aigc/multimodal-generation/generation

# Install the latest version of the DashScope SDK
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

text = "Let me recommend a T-shirt to everyone. This one is really super nice. The color is very elegant, and it's also a perfect item to match. Everyone can buy it without hesitation. It's truly beautiful and very forgiving on the figure. No matter what body type you have, it will look great. I recommend everyone to place an order."
# SpeechSynthesizer interface usage: dashscope.audio.qwen_tts.SpeechSynthesizer.call(...)
response = dashscope.MultiModalConversation.call(
  # To use the instruction control feature, replace the model with qwen3-tts-instruct-flash
  model="qwen3-tts-flash",
  # If the environment variable is not configured, replace the following line with your API key: api_key="sk-xxx"
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  text=text,
  voice="Cherry"
  # To use the instruction control feature, uncomment the following line and replace the model with qwen3-tts-instruct-flash
  # instructions='Fast speech rate, with a clear rising intonation, suitable for introducing fashion products.',
  # optimize_instructions=True
)
print(response)

{
  "status_code": 200,
  "request_id": "5c63c65c-cad8-4bf4-959d-xxxxxxxxxxxx",
  "code": "",
  "message": "",
  "output": {
    "text": null,
    "choices": null,
    "finish_reason": "stop",
    "audio": {
      "url": "https://example.oss.aliyuncs.com/audio-result.wav?Expires=1766113409&OSSAccessKeyId=LTAIxxxx&Signature=xxxx",
      "data": "",
      "id": "audio_5c63c65c-cad8-4bf4-959d-xxxxxxxxxxxx",
      "expires_at": 1766113409
    }
  },
  "usage": {
    "input_tokens": 0,
    "output_tokens": 0,
    "total_tokens": 1121,
    "characters": 195,
    "input_tokens_details": {
      "text_tokens": 76
    },
    "output_tokens_details": {
      "audio_tokens": 1045,
      "text_tokens": 0
    }
  }
}

The DashScope Python SDK uses MultiModalConversation instead of SpeechSynthesizer. Usage and parameters are identical.

Authorizations

string

header

required

DashScope API key. Get your API key from Qwen Cloud console.

Header Parameters

enum<string>

Set to enable for streaming output via HTTP. The Python SDK uses the stream parameter instead. The Java SDK uses the streamCall interface instead.

Available options:enable

Body

application/json

string

required

The model name.

object

required

The input for speech synthesis.

Show child attributes

string

required

The text to synthesize. Mixed-language input is supported. Qwen-TTS supports a maximum input of 512 tokens. Other models support a maximum input of 600 characters.

string

required

The voice to use. See Supported voices.

enum<string>

default"Auto"

Specify the language of the synthesized audio. Default: Auto. When the text is in a single language, specifying the language significantly improves synthesis quality over Auto.

Available options:Auto,Chinese,English,German,Italian,Portuguese,Spanish,Japanese,Korean,French,Russian

string

Provide instructions to guide speech synthesis. Defaults to None. Length limit: must not exceed 1600 tokens. Supported languages: Chinese and English only. This feature applies only to the Qwen3-TTS-Instruct-Flash model series.

boolean

defaultfalse

Optimize the instructions to improve the naturalness and expressiveness of speech synthesis. Default: false. When set to true, the system semantically enhances and rewrites the content of instructions to generate internal instructions better suited for speech synthesis. Enable this for high-quality, fine-grained speech expression. This parameter depends on instructions being set. This feature applies only to the Qwen3-TTS-Instruct-Flash model series.

boolean

defaultfalse

Stream the response. Default: false. When false, returns the URL of the audio after the model finishes generating. When true, outputs Base64-encoded audio data as it is generated.

Note: The stream parameter is supported only by the Python SDK. To achieve streaming output with the Java SDK, call the streamCall interface. To achieve streaming output with HTTP, specify X-DashScope-SSE as enable in the header.

Response

200-application/json

integer

HTTP status code. Examples: 200 (success), 400 (client error), 401 (unauthorized), 404 (not found), 500 (server error).

Example:200

string

The unique ID of the request. Use it to locate and troubleshoot issues.

Example:5c63c65c-cad8-4bf4-959d-xxxxxxxxxxxx

string

Displays the error code when the request fails. See Error codes.

Example:

string

Displays the error message when the request fails. See Error codes.

Example:

object

The model's output.

Show child attributes

string | null

Always null. Ignore this field.

Example:null

unknown

Always null. Ignore this field.

Example:null

enum<string>

null while generation is in progress. "stop" when the model output ends naturally or a stop condition is triggered.

Available options:stop,null

Example:stop

object

Audio information from the model's output.

Show child attributes

string

The URL of the complete audio file output by the model, valid for 24 hours.

Example:https://example.oss.aliyuncs.com/audio-result.wav?Expires=1766113409&OSSAccessKeyId=LTAIxxxx&Signature=xxxx

string

Base64-encoded audio data for streaming output.

Example:

string

The ID corresponding to the audio information output by the model.

Example:audio_5c63c65c-cad8-4bf4-959d-xxxxxxxxxxxx

integer

The UNIX timestamp when the URL expires.

Example:1766113409

object

Token or character consumption information. Qwen-TTS returns token consumption. Qwen3-TTS-Flash returns character consumption.

Show child attributes

integer

The number of tokens consumed by the input text. For Qwen3-TTS-Flash, this field is always 0.

Example:0

integer

The number of tokens consumed by the output audio. For Qwen3-TTS-Flash, this field is always 0.

Example:0

integer

The total number of tokens consumed by this request. Only Qwen-TTS returns this field.

Example:1121

integer

The number of characters in the input text. Only Qwen3-TTS-Flash returns this field.

Example:195

object

Token consumption for the input text. Only Qwen-TTS returns this field.

Show child attributes

integer

The number of tokens consumed by the input text.

Example:76

object

Token consumption for the output. Only Qwen-TTS returns this field.

Show child attributes

integer

Tokens consumed by output audio.

Example:1045

integer

Tokens consumed by output text (currently fixed at 0).

Example:0