Audio & speech FAQ - Qwen Cloud

CosyVoice real-time synthesis

APIs: Python SDK, Java SDK, WebSocket API

How do I fix inaccurate pronunciation?

Use SSML, including the phoneme tag when needed.

How do I get the billed character count?

Python — non-streaming: Character counting rules.
Python — streaming / callbacks: Parse on_event ResultCallback JSON for usage.characters — use the last message.
Python — logging: DASHSCOPE_LOGGING_LEVEL=debug; read characters from the last log line.
Java — non-streaming: Character counting rules.
Java — other modes: getUsage().getCharacters() on SpeechSynthesisResult — last value is final.
WebSocket: payload.usage.characters in result-generated.

Why WebSocket instead of HTTP for TTS? (WebSocket)

The server pushes audio and progress; WebSocket fits low-latency streaming synthesis.

How do I get the request ID?

Python: on_event JSON or get_last_request_id on SpeechSynthesizer.
Java: getRequestId() on SpeechSynthesisResult or getLastRequestId on SpeechSynthesizer.
WebSocket: result-generated or task-finished.

Why does SSML fail?

SSML limits.
Latest SDK.
SDK: SSML on call / non-streaming only — not streaming_call-only flows.
WebSocket: SSML support.

Why can't the audio play?

File: Match output format to extension; use a compatible player.
Stream: For MP3/Opus use a streaming player (FFmpeg, PyAudio, AudioFormat, MediaSource). Append chunks; first chunk may carry WAV/MP3 headers only once.

Why does playback stutter?

Shorter gaps between text sends; light callbacks (offload work off the WebSocket thread); stable network.

Why does synthesis take a long time?

Avoid long pauses between segments; expect ~500 ms first packet and RTF < 1 when healthy.

Why is trailing text missing or no speech returned?

Python: streaming_complete on SpeechSynthesizer.
Java: streamingComplete on SpeechSynthesizer.
WebSocket: finish-task.

Why is audio scrambled or garbled? (WebSocket)

One task_id for run-task, continue-task, and finish-task. Do not reorder writes.

WebSocket closes with code 1007 or auth errors (WebSocket)

1007: Fix JSON and required fields; payload.input only {} or text.
401/403: API key / Troubleshoot authentication failures.

SSL or `WebSocketApp` errors (Python)

Configure CA bundle (SSL_CERT_FILE) or fix macOS cert path; for AttributeError on WebSocketApp, reinstall websocket-client (pip uninstall websocket-client websocket then pip install websocket-client). Details in CosyVoice Python SDK.

How do I restrict my API key to CosyVoice only?

Manage workspaces.

Qwen-Omni Realtime

APIs: Python SDK, Java SDK, Client events, Server events

How are input audio and images aligned?

Audio is the timeline. Images attach at send time; you can turn video on or off during the session.

What input rates are recommended?

About 2 fps for images and ~100 ms audio packets for real-time use.

What is the difference between `turn_detection` on and off?

With turn_detection enabled (server VAD):

End of utterance is detected; inference runs automatically; text/audio responses return.
Input can continue during the model response; then the session returns to listening.
Barge-in: speaking during playback stops the response and returns to input.

With turn_detection disabled:

You end the turn and call commit and create_response / createResponse yourself.
Pause audio and video while the model is responding; resume after it finishes.
Use cancel_response / cancelResponse to interrupt.

With turn_detection on, you can still use commit, create_response, and cancel_response manually.

Why use another model for `input_audio_transcription`?

Omni responds to input; it is not a dedicated ASR transcript pipeline. Use a separate speech-to-text model for verbatim transcription.

Common audio issues

Recognition issues often come from container vs codec mismatch or wrong sample_rate / format. Verify the real encoding, not only the file extension.

Convert audio with FFmpeg

Use FFmpeg to transcode into a supported format.

ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext

ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
ffmpeg -i input.m4a -c:a copy output.aac
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Inspect container, codec, sample rate, and channels

Use ffprobe:

ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx

Use the reported values for format, sample_rate / sampleRate, and channels where applicable.

Fun-ASR realtime

APIs: Python SDK, Java SDK, WebSocket

How do I keep a connection alive during long silence?

Set heartbeat to true and send silent audio continuously. Generate silent audio with FFmpeg:

# Generate 1 second of silent audio at 16kHz, mono
ffmpeg -f lavfi -i anullsrc=r=16000:cl=mono -t 1 -acodec pcm_s16le silent.wav

Alternatively, use Audacity or Adobe Audition to create silent segments.

How do I recognize a local audio file?

Python: call on the Recognition class for a full file (non-streaming), or send_audio_frame for streaming (bidirectional streaming).
Java: call (non-streaming), or sendAudioFrame / streamCall (callback / Flowable streaming).

Why use WebSocket instead of HTTP? (WebSocket)

WebSocket is full-duplex; HTTP request/response alone is not suited to continuous real-time audio.

Why is speech not recognized?

Match format and sample_rate / sampleRate to the real audio. See Common audio issues.
Match language_hints (Python) / languageHints (Java) to the spoken language.
Use custom vocabulary / hotwords for business terms, product names, or proper nouns that need precise recognition. See Custom hotwords.

Fun-ASR file transcription

APIs: Python SDK, RESTful API, Java SDK

Is Base64-encoded audio supported?

No. Use public HTTP(S) URLs only — not Base64, raw binary uploads, or local paths as the file reference.

How do I host audio at a public URL?

1. Choose a storage method

Object storage (recommended): Public read or signed URLs; CDN-friendly.
Web server: HTTPS for small tests.
CDN: For high concurrency.

2. Upload the file

Object storage: Bucket, upload, public-read or temporary link.
Web server: Served directory (for example /var/www/html/audio/).

3. Get the public URL

Storage: https://<bucket>.<region>.aliyuncs.com/<object-key> or custom domain.
Web: https://your-domain.com/audio/file.mp3
CDN: https://cdn.your-domain.com/audio/file.mp3

4. Verify the URL

Browser or curl/Postman — HTTP 200 and playable audio.

How long does recognition take?

Tasks go PENDING → RUNNING → SUCCEEDED or FAILED. Queue time depends on load and file length.

Why can't I get a result after polling?

Possible rate limiting — inspect responses and back off.

What if an OSS temporary URL is inaccessible? (REST)

Set header X-DashScope-OssResourceResolve to enable.

Intended for raw REST. Official Java/Python SDKs may not support this header; prefer stable public URLs where possible.

Why is the audio not recognized?

Check format and sample rate; use ffprobe in Common audio issues.

​CosyVoice real-time synthesis

​How do I fix inaccurate pronunciation?

​How do I get the billed character count?

​Why WebSocket instead of HTTP for TTS? (WebSocket)

​How do I get the request ID?

​Why does SSML fail?

​Why can't the audio play?

​Why does playback stutter?

​Why does synthesis take a long time?

​Why is trailing text missing or no speech returned?

​Why is audio scrambled or garbled? (WebSocket)

​WebSocket closes with code 1007 or auth errors (WebSocket)

​SSL or WebSocketApp errors (Python)

​How do I restrict my API key to CosyVoice only?

​More CosyVoice questions

​Qwen-Omni Realtime

​How are input audio and images aligned?

​What input rates are recommended?

​What is the difference between turn_detection on and off?

​Why use another model for input_audio_transcription?

​Common audio issues

​Convert audio with FFmpeg

​Inspect container, codec, sample rate, and channels

​Fun-ASR realtime

​How do I keep a connection alive during long silence?

​How do I recognize a local audio file?

​Why use WebSocket instead of HTTP? (WebSocket)

​Why is speech not recognized?

​Fun-ASR file transcription

​Is Base64-encoded audio supported?

​How do I host audio at a public URL?

​How long does recognition take?

​Why can't I get a result after polling?

​What if an OSS temporary URL is inaccessible? (REST)

​Why is the audio not recognized?

CosyVoice real-time synthesis

How do I fix inaccurate pronunciation?

How do I get the billed character count?

Why WebSocket instead of HTTP for TTS? (WebSocket)

How do I get the request ID?

Why does SSML fail?

Why can't the audio play?

Why does playback stutter?

Why does synthesis take a long time?

Why is trailing text missing or no speech returned?

Why is audio scrambled or garbled? (WebSocket)

WebSocket closes with code 1007 or auth errors (WebSocket)

SSL or `WebSocketApp` errors (Python)

How do I restrict my API key to CosyVoice only?

More CosyVoice questions

Qwen-Omni Realtime

How are input audio and images aligned?

What input rates are recommended?

What is the difference between `turn_detection` on and off?

Why use another model for `input_audio_transcription`?

Common audio issues

Convert audio with FFmpeg

Inspect container, codec, sample rate, and channels

Fun-ASR realtime

How do I keep a connection alive during long silence?

How do I recognize a local audio file?

Why use WebSocket instead of HTTP? (WebSocket)

Why is speech not recognized?

Fun-ASR file transcription

Is Base64-encoded audio supported?

How do I host audio at a public URL?

How long does recognition take?

Why can't I get a result after polling?

What if an OSS temporary URL is inaccessible? (REST)

Why is the audio not recognized?