Non-realtime speech synthesis with Qwen3-TTS
Non-real-time speech synthesis converts text to speech (TTS) through an HTTP API. It suits latency-tolerant scenarios such as audiobook production, e-learning narration, and content production.
Convert complete text to speech files through an HTTP API. Two output modes are available: non-streaming and streaming.
The following examples demonstrate how to synthesize speech with the Qwen-TTS model family. For detailed parameter descriptions, see the API reference.
Use a built-in voice for speech synthesis.
Non-streaming output
Use the returned
Streaming output
Stream audio data in Base64 format. The last packet contains the URL for the complete audio file.
Voice cloning does not provide preview audio. Apply the cloned voice to speech synthesis to evaluate the result.
These examples adapt the non-streaming output code, replacing the
Voice design returns preview audio. Listen to the preview to confirm it meets your expectations before using it for synthesis to reduce costs.
Control pitch, speed, emotion, and timbre using natural language instructions instead of audio parameters.
Supported models: Qwen3-TTS-Instruct-Flash series only.
Usage: Specify instructions in the
Examples
Qwen3-TTS supports both voice cloning (Qwen3-TTS-VC) and voice design (Qwen3-TTS-VD). See Voice cloning (Qwen) and Voice design (Qwen) for the API reference.
See Qwen-TTS voice list for the list of supported voices, model compatibility, and audio samples.
Q: How long is the audio file URL valid?
The audio file URL expires after 24 hours.
Overview
Convert complete text to speech files through an HTTP API. Two output modes are available: non-streaming and streaming.
- Non-streaming mode returns an audio file URL that expires after 24 hours. Streaming mode returns PCM audio data in chunks.
- Supports multiple languages, including Chinese dialects.
- Supports voice cloning and voice design for custom voice creation.
- Supports instruction control, which lets you control speech expressiveness through natural-language instructions.
Prerequisites
- Get an API key and set it as an environment variable.
- To use the SDK, install it. The Java SDK requires version 2.21.9+. The Python SDK requires version 1.24.6+.
Quick start
The following examples demonstrate how to synthesize speech with the Qwen-TTS model family. For detailed parameter descriptions, see the API reference.
Use system voice
Use a built-in voice for speech synthesis.
Non-streaming output
Use the returned url to retrieve the synthesized audio. The URL is valid for 24 hours.
You must import the Gson dependency for Java. If you use Maven or Gradle, add the dependency as follows:
- Maven
- Gradle
Add the following content to
pom.xml:Use cloned voice
Voice cloning does not provide preview audio. Apply the cloned voice to speech synthesis to evaluate the result.
These examples adapt the non-streaming output code, replacing the voice parameter with a cloned voice.
- Key principle: The model used for voice cloning (
target_model) must match the model used for speech synthesis (model). Otherwise, synthesis fails. - This example uses the local audio file
voice.mp3for voice cloning. Replace this path when running the code.
- Maven
- Gradle
Add the following content to your
pom.xml:When using a custom voice generated by voice cloning for speech synthesis, set the voice as follows:
Use designed voice
Voice design returns preview audio. Listen to the preview to confirm it meets your expectations before using it for synthesis to reduce costs.
1
Generate a custom voice and preview the result
If you are satisfied with the result, proceed to the next step. Otherwise, generate it again.You need to import the Gson dependency for Java. If you are using Maven or Gradle, add the dependency as follows:
- Maven
- Gradle
Add the following content to
pom.xml:When using a custom voice generated by voice design for speech synthesis, you must set the voice as follows:
2
Use the custom voice for speech synthesis
Use the custom voice generated in the previous step for non-streaming speech synthesis.This example adapts the non-streaming output code, replacing the
voice parameter with the custom voice generated by voice design. For streaming synthesis, see Quick start.Key principle: The model used for voice design (target_model) must be the same as the model used for subsequent speech synthesis (model). Otherwise, the synthesis will fail.Instruction control
Control pitch, speed, emotion, and timbre using natural language instructions instead of audio parameters.
Supported models: Qwen3-TTS-Instruct-Flash series only.
Usage: Specify instructions in the instructions parameter. Example: "Fast-paced with rising intonation, suitable for fashion products."
Supported languages: Chinese and English only.
Length limit: Maximum 1600 tokens.
Scenarios:
- Audiobook and radio drama voice-overs
- Advertising and promotional video voice-overs
- Game role and animation voice-overs
- Emotionally intelligent voice assistants
- Documentary and news broadcasting
- Be specific: Use descriptive words such as "deep," "crisp," or "fast-paced." Avoid vague words such as "nice" or "normal."
- Be multi-dimensional: Combine multiple dimensions such as pitch, speed, and emotion. Single-dimension descriptions such as "high-pitched" are too broad.
- Be objective: Focus on physical and perceptual features, not personal preferences. Use "high-pitched and energetic" instead of "my favorite sound."
- Be original: Describe sound qualities instead of requesting imitation of specific people. The model does not support direct imitation.
- Be concise: Ensure every word serves a purpose. Avoid repetitive synonyms or meaningless intensifiers.
| Dimension | Example |
|---|---|
| Pitch | High, medium, low, high-pitched, low-pitched |
| Speed | Fast, medium, slow, fast-paced, slow-paced |
| Emotion | Cheerful, calm, gentle, serious, lively, composed, soothing |
| Characteristics | Magnetic, crisp, hoarse, mellow, sweet, deep, powerful |
| Usage | News broadcast, ad voice-over, audiobook, animation role, voice assistant, documentary narration |
- Standard broadcast style: Clear and precise articulation, well-rounded pronunciation.
- Progressive emotional effect: Volume rapidly increases from normal conversation to a shout, with a straightforward personality and easily excited, expressive emotions.
- Special emotional state: A sobbing tone causes slightly slurred and hoarse pronunciation, with noticeable tension in the crying voice.
- Ad voice-over style: High-pitched, medium speed, full of energy and appeal, suitable for ad voice-overs.
- Gentle and soothing style: Slow-paced, with a gentle and sweet pitch, and a soothing, warm tone, like a caring friend.
Voice customization
Qwen3-TTS supports both voice cloning (Qwen3-TTS-VC) and voice design (Qwen3-TTS-VD). See Voice cloning (Qwen) and Voice design (Qwen) for the API reference.
API reference
Built-in voices
See Qwen-TTS voice list for the list of supported voices, model compatibility, and audio samples.
FAQ
Q: How long is the audio file URL valid?
The audio file URL expires after 24 hours.
Learn more
- Real-time speech synthesis — Real-time streaming speech synthesis with WebSocket
- CosyVoice voice list
- Qwen-TTS voice list