Skip to main content
Text-to-Speech

Voice cloning

Clone a voice from audio samples for use with Qwen TTS models.

Clone a voice from 10-20 seconds of audio. The API returns a voice identifier instantly -- no training required.

How it works

  1. Clone a voice -- Call the voice cloning API with an audio sample. The API returns a voice identifier instantly.
  2. Synthesize speech -- Pass the voice identifier to a synthesis endpoint. The synthesis model must match the target_model from step 1.
Set target_model during voice creation to match the synthesis model. Mismatched models cause synthesis to fail.
Workflow diagram

Choose a model

  • Voice cloning model: qwen-voice-enrollment (fixed for all requests)
  • Speech synthesis model (target_model): Choose based on latency and streaming needs:
Model seriesModel IDStreamingLatencyUse case
Qwen3-TTS-VC-Realtimeqwen3-tts-vc-realtime-2026-01-15Bidirectional (WebSocket)LowReal-time applications, conversational AI, live audio
qwen3-tts-vc-realtime-2025-11-27Bidirectional (WebSocket)LowReal-time applications (previous version)
Qwen3-TTS-VCqwen3-tts-vc-2026-01-22Non-streaming / unidirectionalStandardBatch processing, pre-recorded content, offline generation
These models only support custom cloned voices, not system voices like Chelsie, Serena, Ethan, or Cherry.
For model details, see Realtime streaming TTS or Qwen TTS.

Audio requirements

ItemRequirement
FormatWAV (16-bit), MP3, M4A
Duration10 -- 20 seconds recommended. 60 seconds maximum.
File sizeLess than 10 MB
Sample rate24 kHz or higher
ChannelsMono
ContentAt least 3 seconds of continuous, clear speech. Short pauses (up to 2 seconds) are acceptable. No background music, ambient noise, or overlapping voices. Do not use singing or song audio.
LanguageChinese (zh), English (en), German (de), Italian (it), Portuguese (pt), Spanish (es), Japanese (ja), Korean (ko), French (fr), Russian (ru)

Recording tips

Quick-start checklist

Use this checklist in a standard bedroom or similar small room:
  1. Close all windows and doors to block external noise.
  2. Turn off air conditioners, fans, and other electrical devices.
  3. Draw curtains to reduce glass reflections.
  4. Cover your desk with clothing or a blanket to reduce surface reflections.
  5. Read through your script. Define your character's tone and practice delivering naturally.
  6. Position the recording device approximately 10 cm from your mouth. Too close causes plosive distortion; too far produces a weak signal.
  7. Start recording.

Recording devices

Use a smartphone, digital voice recorder, or professional audio recorder.

Set up your recording environment

Choose the right room:
RequirementDetails
Room sizeRecord in a small enclosed space (max 10 m²).
Acoustic treatmentChoose a room with sound-absorbing materials: acoustic foam, carpets, or curtains.
Spaces to avoidAvoid auditoriums, conference rooms, and classrooms — these large spaces cause strong reverberation that degrades clone quality.
Control noise:
Noise sourceMitigation
Outdoor noiseClose all windows and doors. Avoid recording near traffic or construction.
Indoor noiseTurn off air conditioners, fans, and fluorescent lamp ballasts before recording.
Record a few seconds of ambient sound on your smartphone, then play it back at high volume to identify hidden noise sources.
Reduce reverberation: Reverberation blurs speech and reduces definition, directly impacting clone fidelity.
  • Draw curtains, open closet doors, or cover desks/cabinets with clothing or bed sheets to reduce reflections from smooth surfaces.
  • Place irregular objects (bookshelves, upholstered furniture) to scatter sound waves.

Prepare your script

GuidelineDetails
ContentNo strict restrictions apply. Align content with your target use case.
Sentence structureUse complete sentences. Avoid short phrases ("Hello", "Yes") that lack vocal information for cloning.
ContinuityMaintain semantic continuity — pause infrequently and aim for 3+ seconds of uninterrupted speech per segment.
Emotional expressionAdd appropriate emotional expression (warmth, friendliness, seriousness). Monotone delivery reduces clone naturalness.
Content restrictionsDo not include sensitive words (politics, pornography, violence). Recordings with this content will fail cloning.

End-to-end example

Create a cloned voice from a local audio file, then use it for speech synthesis. Both steps use the same target_model. Replace voice.mp3 with the path to your own audio file.

Bidirectional streaming (real-time)

Applies to Qwen3-TTS-VC-Realtime models. For parameter details, see Realtime streaming TTS.
  • Python
  • Java
# pyaudio installation:
#   macOS:   brew install portaudio && pip install pyaudio
#   Ubuntu:  sudo apt-get install python3-pyaudio  (or pip install pyaudio)
#   CentOS:  sudo yum install -y portaudio portaudio-devel && pip install pyaudio
#   Windows: python -m pip install pyaudio

import pyaudio
import os
import requests
import base64
import pathlib
import threading
import time
import dashscope
from dashscope.audio.qwen_tts_realtime import QwenTtsRealtime, QwenTtsRealtimeCallback, AudioFormat

TARGET_MODEL = "qwen3-tts-vc-realtime-2026-01-15"
VOICE_FILE = "voice.mp3"  # Replace with your audio file

TEXT_TO_SYNTHESIZE = [
  'Today we explore the wonders of speech synthesis.',
  'Each voice carries a unique character.',
  'With voice cloning, you can bring any text to life.',
  "Let's create something amazing together."
]

def create_voice(file_path: str) -> str:
  """Create a cloned voice and return the voice identifier."""
  api_key = os.getenv("DASHSCOPE_API_KEY")
  file_path_obj = pathlib.Path(file_path)
  base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
  data_uri = f"data:audio/mpeg;base64,{base64_str}"

  response = requests.post(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
    headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
    json={
      "model": "qwen-voice-enrollment",
      "input": {
        "action": "create",
        "target_model": TARGET_MODEL,
        "preferred_name": "myvoice",
        "audio": {"data": data_uri}
      }
    }
  )
  return response.json()["output"]["voice"]

class MyCallback(QwenTtsRealtimeCallback):
  def __init__(self):
    self.complete_event = threading.Event()
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

  def on_event(self, response: dict) -> None:
    if response.get("type") == "response.audio.delta":
      audio_data = base64.b64decode(response["delta"])
      self._stream.write(audio_data)
    elif response.get("type") == "session.finished":
      self.complete_event.set()

if __name__ == "__main__":
  dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
  callback = MyCallback()
  tts = QwenTtsRealtime(model=TARGET_MODEL, callback=callback,
                        url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
  tts.connect()
  tts.update_session(voice=create_voice(VOICE_FILE),
                     response_format=AudioFormat.PCM_24000HZ_MONO_16BIT, mode="server_commit")

  for text in TEXT_TO_SYNTHESIZE:
    tts.append_text(text)
    time.sleep(0.1)

  tts.finish()
  callback.complete_event.wait()

Non-streaming synthesis

Applies to Qwen3-TTS-VC models. For details, see Qwen TTS.
  • Python
  • Java
import os
import requests
import base64
import pathlib
import dashscope

TARGET_MODEL = "qwen3-tts-vc-2026-01-22"
VOICE_FILE = "voice.mp3"

def create_voice(file_path: str) -> str:
  api_key = os.getenv("DASHSCOPE_API_KEY")
  base64_str = base64.b64encode(pathlib.Path(file_path).read_bytes()).decode()
  data_uri = f"data:audio/mpeg;base64,{base64_str}"

  response = requests.post(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
    headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
    json={
      "model": "qwen-voice-enrollment",
      "input": {"action": "create", "target_model": TARGET_MODEL,
                "preferred_name": "myvoice", "audio": {"data": data_uri}}
    }
  )
  return response.json()["output"]["voice"]

if __name__ == "__main__":
  dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
  response = dashscope.MultiModalConversation.call(
    model=TARGET_MODEL,
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    text="Today we explore the wonders of speech synthesis.",
    voice=create_voice(VOICE_FILE),
    stream=False
  )
  print(response)

Troubleshooting

If you encounter errors, see Error messages.

Next steps