Clone a voice from 10-20 seconds of audio. The API returns a voice identifier instantly -- no training required.
How it works
- Clone a voice -- Call the voice cloning API with an audio sample. The API returns a
voice identifier instantly.
- Synthesize speech -- Pass the
voice identifier to a synthesis endpoint. The synthesis model must match the target_model from step 1.
Set target_model during voice creation to match the synthesis model. Mismatched models cause synthesis to fail.
Choose a model
- Voice cloning model:
qwen-voice-enrollment (fixed for all requests)
- Speech synthesis model (
target_model): Choose based on latency and streaming needs:
| Model series | Model ID | Streaming | Latency | Use case |
|---|
| Qwen3-TTS-VC-Realtime | qwen3-tts-vc-realtime-2026-01-15 | Bidirectional (WebSocket) | Low | Real-time applications, conversational AI, live audio |
| qwen3-tts-vc-realtime-2025-11-27 | Bidirectional (WebSocket) | Low | Real-time applications (previous version) |
| Qwen3-TTS-VC | qwen3-tts-vc-2026-01-22 | Non-streaming / unidirectional | Standard | Batch processing, pre-recorded content, offline generation |
These models only support custom cloned voices, not system voices like Chelsie, Serena, Ethan, or Cherry.
For model details, see Realtime streaming TTS or Qwen TTS.
Audio requirements
| Item | Requirement |
|---|
| Format | WAV (16-bit), MP3, M4A |
| Duration | 10 -- 20 seconds recommended. 60 seconds maximum. |
| File size | Less than 10 MB |
| Sample rate | 24 kHz or higher |
| Channels | Mono |
| Content | At least 3 seconds of continuous, clear speech. Short pauses (up to 2 seconds) are acceptable. No background music, ambient noise, or overlapping voices. Do not use singing or song audio. |
| Language | Chinese (zh), English (en), German (de), Italian (it), Portuguese (pt), Spanish (es), Japanese (ja), Korean (ko), French (fr), Russian (ru) |
Recording tips
Quick-start checklist
Use this checklist in a standard bedroom or similar small room:
- Close all windows and doors to block external noise.
- Turn off air conditioners, fans, and other electrical devices.
- Draw curtains to reduce glass reflections.
- Cover your desk with clothing or a blanket to reduce surface reflections.
- Read through your script. Define your character's tone and practice delivering naturally.
- Position the recording device approximately 10 cm from your mouth. Too close causes plosive distortion; too far produces a weak signal.
- Start recording.
Recording devices
Use a smartphone, digital voice recorder, or professional audio recorder.
Set up your recording environment
Choose the right room:
| Requirement | Details |
|---|
| Room size | Record in a small enclosed space (max 10 m²). |
| Acoustic treatment | Choose a room with sound-absorbing materials: acoustic foam, carpets, or curtains. |
| Spaces to avoid | Avoid auditoriums, conference rooms, and classrooms — these large spaces cause strong reverberation that degrades clone quality. |
Control noise:
| Noise source | Mitigation |
|---|
| Outdoor noise | Close all windows and doors. Avoid recording near traffic or construction. |
| Indoor noise | Turn off air conditioners, fans, and fluorescent lamp ballasts before recording. |
Record a few seconds of ambient sound on your smartphone, then play it back at high volume to identify hidden noise sources.
Reduce reverberation:
Reverberation blurs speech and reduces definition, directly impacting clone fidelity.
- Draw curtains, open closet doors, or cover desks/cabinets with clothing or bed sheets to reduce reflections from smooth surfaces.
- Place irregular objects (bookshelves, upholstered furniture) to scatter sound waves.
Prepare your script
| Guideline | Details |
|---|
| Content | No strict restrictions apply. Align content with your target use case. |
| Sentence structure | Use complete sentences. Avoid short phrases ("Hello", "Yes") that lack vocal information for cloning. |
| Continuity | Maintain semantic continuity — pause infrequently and aim for 3+ seconds of uninterrupted speech per segment. |
| Emotional expression | Add appropriate emotional expression (warmth, friendliness, seriousness). Monotone delivery reduces clone naturalness. |
| Content restrictions | Do not include sensitive words (politics, pornography, violence). Recordings with this content will fail cloning. |
End-to-end example
Create a cloned voice from a local audio file, then use it for speech synthesis. Both steps use the same target_model.
Replace voice.mp3 with the path to your own audio file.
Bidirectional streaming (real-time)
Applies to Qwen3-TTS-VC-Realtime models. For parameter details, see Realtime streaming TTS.
# pyaudio installation:
# macOS: brew install portaudio && pip install pyaudio
# Ubuntu: sudo apt-get install python3-pyaudio (or pip install pyaudio)
# CentOS: sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Windows: python -m pip install pyaudio
import pyaudio
import os
import requests
import base64
import pathlib
import threading
import time
import dashscope
from dashscope.audio.qwen_tts_realtime import QwenTtsRealtime, QwenTtsRealtimeCallback, AudioFormat
TARGET_MODEL = "qwen3-tts-vc-realtime-2026-01-15"
VOICE_FILE = "voice.mp3" # Replace with your audio file
TEXT_TO_SYNTHESIZE = [
'Today we explore the wonders of speech synthesis.',
'Each voice carries a unique character.',
'With voice cloning, you can bring any text to life.',
"Let's create something amazing together."
]
def create_voice(file_path: str) -> str:
"""Create a cloned voice and return the voice identifier."""
api_key = os.getenv("DASHSCOPE_API_KEY")
file_path_obj = pathlib.Path(file_path)
base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
data_uri = f"data:audio/mpeg;base64,{base64_str}"
response = requests.post(
"https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
json={
"model": "qwen-voice-enrollment",
"input": {
"action": "create",
"target_model": TARGET_MODEL,
"preferred_name": "myvoice",
"audio": {"data": data_uri}
}
}
)
return response.json()["output"]["voice"]
class MyCallback(QwenTtsRealtimeCallback):
def __init__(self):
self.complete_event = threading.Event()
self._player = pyaudio.PyAudio()
self._stream = self._player.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
def on_event(self, response: dict) -> None:
if response.get("type") == "response.audio.delta":
audio_data = base64.b64decode(response["delta"])
self._stream.write(audio_data)
elif response.get("type") == "session.finished":
self.complete_event.set()
if __name__ == "__main__":
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
callback = MyCallback()
tts = QwenTtsRealtime(model=TARGET_MODEL, callback=callback,
url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
tts.connect()
tts.update_session(voice=create_voice(VOICE_FILE),
response_format=AudioFormat.PCM_24000HZ_MONO_16BIT, mode="server_commit")
for text in TEXT_TO_SYNTHESIZE:
tts.append_text(text)
time.sleep(0.1)
tts.finish()
callback.complete_event.wait()
import com.alibaba.dashscope.audio.qwen_tts_realtime.*;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.file.*;
import java.util.Base64;
import java.util.concurrent.CountDownLatch;
public class Main {
private static final String TARGET_MODEL = "qwen3-tts-vc-realtime-2026-01-15";
private static final String AUDIO_FILE = "voice.mp3"; // Replace with your audio file
public static String createVoice() throws Exception {
String apiKey = System.getenv("DASHSCOPE_API_KEY");
byte[] bytes = Files.readAllBytes(Paths.get(AUDIO_FILE));
String encoded = Base64.getEncoder().encodeToString(bytes);
String dataUri = "data:audio/mpeg;base64," + encoded;
String jsonPayload = "{\"model\":\"qwen-voice-enrollment\",\"input\":{"
+ "\"action\":\"create\",\"target_model\":\"" + TARGET_MODEL + "\","
+ "\"preferred_name\":\"myvoice\",\"audio\":{\"data\":\"" + dataUri + "\"}}}";
HttpURLConnection con = (HttpURLConnection) new URL(
"https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization").openConnection();
con.setRequestMethod("POST");
con.setRequestProperty("Authorization", "Bearer " + apiKey);
con.setRequestProperty("Content-Type", "application/json");
con.setDoOutput(true);
try (OutputStream os = con.getOutputStream()) {
os.write(jsonPayload.getBytes("UTF-8"));
}
BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream(), "UTF-8"));
StringBuilder response = new StringBuilder();
String line;
while ((line = br.readLine()) != null) response.append(line);
return new Gson().fromJson(response.toString(), JsonObject.class)
.getAsJsonObject("output").get("voice").getAsString();
}
public static void main(String[] args) throws Exception {
CountDownLatch latch = new CountDownLatch(1);
QwenTtsRealtimeParam param = QwenTtsRealtimeParam.builder()
.model(TARGET_MODEL)
.url("wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
.apikey(System.getenv("DASHSCOPE_API_KEY"))
.build();
QwenTtsRealtime tts = new QwenTtsRealtime(param, new QwenTtsRealtimeCallback() {
public void onEvent(JsonObject msg) {
if (msg.get("type").getAsString().equals("session.finished")) latch.countDown();
}
});
tts.connect();
QwenTtsRealtimeConfig config = QwenTtsRealtimeConfig.builder()
.voice(createVoice())
.responseFormat(QwenTtsRealtimeAudioFormat.PCM_24000HZ_MONO_16BIT)
.mode("server_commit").build();
tts.updateSession(config);
for (String text : new String[]{
"Today we explore the wonders of speech synthesis.",
"Each voice carries a unique character.",
"With voice cloning, you can bring any text to life.",
"Let's create something amazing together."}) {
tts.appendText(text);
Thread.sleep(100);
}
tts.finish();
latch.await();
}
}
Non-streaming synthesis
Applies to Qwen3-TTS-VC models. For details, see Qwen TTS.
import os
import requests
import base64
import pathlib
import dashscope
TARGET_MODEL = "qwen3-tts-vc-2026-01-22"
VOICE_FILE = "voice.mp3"
def create_voice(file_path: str) -> str:
api_key = os.getenv("DASHSCOPE_API_KEY")
base64_str = base64.b64encode(pathlib.Path(file_path).read_bytes()).decode()
data_uri = f"data:audio/mpeg;base64,{base64_str}"
response = requests.post(
"https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
json={
"model": "qwen-voice-enrollment",
"input": {"action": "create", "target_model": TARGET_MODEL,
"preferred_name": "myvoice", "audio": {"data": data_uri}}
}
)
return response.json()["output"]["voice"]
if __name__ == "__main__":
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
response = dashscope.MultiModalConversation.call(
model=TARGET_MODEL,
api_key=os.getenv("DASHSCOPE_API_KEY"),
text="Today we explore the wonders of speech synthesis.",
voice=create_voice(VOICE_FILE),
stream=False
)
print(response)
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.file.*;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
public class Main {
private static final String TARGET_MODEL = "qwen3-tts-vc-2026-01-22";
private static final String AUDIO_FILE = "voice.mp3"; // Replace with your audio file
public static String createVoice() throws Exception {
String apiKey = System.getenv("DASHSCOPE_API_KEY");
byte[] bytes = Files.readAllBytes(Paths.get(AUDIO_FILE));
String encoded = Base64.getEncoder().encodeToString(bytes);
String dataUri = "data:audio/mpeg;base64," + encoded;
String jsonPayload = "{\"model\":\"qwen-voice-enrollment\",\"input\":{"
+ "\"action\":\"create\",\"target_model\":\"" + TARGET_MODEL + "\","
+ "\"preferred_name\":\"myvoice\",\"audio\":{\"data\":\"" + dataUri + "\"}}}";
HttpURLConnection con = (HttpURLConnection) new URL(
"https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization").openConnection();
con.setRequestMethod("POST");
con.setRequestProperty("Authorization", "Bearer " + apiKey);
con.setRequestProperty("Content-Type", "application/json");
con.setDoOutput(true);
try (OutputStream os = con.getOutputStream()) {
os.write(jsonPayload.getBytes(StandardCharsets.UTF_8));
}
BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream(), StandardCharsets.UTF_8));
StringBuilder response = new StringBuilder();
String line;
while ((line = br.readLine()) != null) response.append(line);
return new Gson().fromJson(response.toString(), JsonObject.class)
.getAsJsonObject("output").get("voice").getAsString();
}
public static void main(String[] args) {
try {
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
MultiModalConversation conv = new MultiModalConversation();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model(TARGET_MODEL)
.text("Today we explore the wonders of speech synthesis.")
.parameter("voice", createVoice())
.build();
MultiModalConversationResult result = conv.call(param);
String audioUrl = result.getOutput().getAudio().getUrl();
System.out.println("Audio URL: " + audioUrl);
// Download audio
try (InputStream in = new URL(audioUrl).openStream();
FileOutputStream out = new FileOutputStream("output.wav")) {
byte[] buffer = new byte[1024];
int bytesRead;
while ((bytesRead = in.read(buffer)) != -1) {
out.write(buffer, 0, bytesRead);
}
System.out.println("Audio saved to output.wav");
}
} catch (Exception e) {
System.out.println("Error: " + e.getMessage());
}
System.exit(0);
}
}
Troubleshooting
If you encounter errors, see Error messages.
Next steps