Improve speech recognition accuracy using custom hotwords and context enhancement.
Qwen Cloud provides two methods to improve ASR accuracy: custom hotwords for term-level biasing and context enhancement for conversation-aware recognition.
Hotwords are supported by Fun-ASR models. The following models are available:
Workflow:
Submit a JSON array of hotword objects.
Example: Improve movie title recognition (Fun-ASR and Paraformer series models)
Field descriptions:
Hotword text length rules:
Weight controls how strongly the model favors a hotword. Set it appropriately to improve target word accuracy without introducing false recognitions.
Start with
API reference: Custom Hotword API Reference
Context enhancement is supported by:
Use case: Best suited for scenarios that combine ASR with large language models. Pass the preceding conversation context (model responses and user speech recognition results) to the ASR model. This significantly improves transcription accuracy for specialized terms such as names, locations, and product terminology, and is more flexible than traditional hotwords.
Usage: Pass conversation history through
The
In the example above, adding a word list or natural language paragraph containing terms like "Bulge Bracket" to the
Check the following in order:
Hotword lists are created the same way. The calling method differs:
In addition to hotwords and context enhancement, consider the following:
| Feature | How it works | Best for |
|---|---|---|
| Custom hotwords | Boost specific terms with priority weights | Fixed terminology: product names, proper nouns, medical terms |
| Context enhancement | Pass conversation history to the ASR model | Dynamic context: names, locations, domain terms from ongoing conversations |
Prerequisites
- Get your API key and set it as an environment variable.
- Install the DashScope SDK.
Custom hotwords
Supported scope
Hotwords are supported by Fun-ASR models. The following models are available:
- Real-time speech recognition: fun-asr-realtime, fun-asr-realtime-2025-11-07
- Non-real-time speech recognition: fun-asr, fun-asr-2025-11-07, fun-asr-2025-08-25, fun-asr-mtl, fun-asr-mtl-2025-08-25, fun-asr-flash-2026-06-15
Quick start
Workflow:
- Create a hotword list: Call the Create API to define a list of hotwords and set
target_modelto the speech recognition model you plan to use. - Use the hotword list: Pass the hotword list ID (
vocabulary_id) in the speech recognition request parameters. Ensure thattarget_modelmatches the model being called.
- Python
- Java
Hotword format
Submit a JSON array of hotword objects.
Example: Improve movie title recognition (Fun-ASR and Paraformer series models)
| Field | Type | Required | Description |
|---|---|---|---|
| text | string | Yes | The hotword text. Must be supported by the selected model. Use actual words, not random characters. See length rules below. |
| weight | int | Yes | Priority weight, an integer from 1 to 5. Start with 4. Increase if results are weak, but too high a weight can hurt recognition of other words. |
| lang | string | No | Language code. Boosts hotwords for a specific language. Leave empty for auto-detection. See the model's API reference for supported codes. If you set language_hints, only matching hotwords take effect. |
-
Contains non-ASCII characters: Maximum 15 characters total, including non-ASCII characters (Chinese, Japanese kana, Korean Hangul, Russian Cyrillic) and ASCII characters.
Examples:
"厄洛替尼盐酸盐"(7 Chinese characters)"EGFR抑制剂"(3 Chinese characters and 4 ASCII characters, for a total of 7 characters)"こんにちは"(5 characters)"Фенибут Белфарм"(15 characters, including the space)"Клофелин Белмедпрепараты"(24 characters) -- exceeds limit
-
Contains only ASCII characters: Maximum 7 segments. A segment is a sequence of characters separated by spaces.
Examples:
"Exothermic reaction"-- 2 segments"Human immunodeficiency virus type 1"-- 5 segments"The effect of temperature variations on enzyme activity in biochemical reactions"-- 11 segments, exceeds limit
Tune hotword performance
Adjust hotword weights
Weight controls how strongly the model favors a hotword. Set it appropriately to improve target word accuracy without introducing false recognitions.
| Weight | Effect | Best for |
|---|---|---|
| 1-2 | Slight preference | Hotwords that sound similar to common words, where overcorrection must be avoided |
| 3-4 | Clear preference (recommended) | The best starting point for most scenarios |
| 5 | Forced preference | Use only when the term appears frequently in the audio and is unlikely to be confused with other words. An excessively high weight can cause phonetically similar words to be misrecognized as the hotword. |
weight=4 and adjust incrementally based on recognition results.
Design hotword lists
- Group by scenario: Create separate vocabulary lists for different business scenarios (for example, one for medical terms and another for product names) to simplify maintenance and reuse.
- Mix multiple languages: A single vocabulary list can contain terms in different languages. Use the
langfield to distinguish them. Whenlanguage_hintsis specified during speech recognition, only hotwords that match the specified language take effect. - Clean up regularly: Delete unused vocabulary lists to free up quota. Each account supports up to 10 lists.
Limits and billing
| Limit | Description |
|---|---|
| Number of vocabulary lists | 10 per account, shared across all models. |
| Hotwords per list | Up to 500 hotwords per vocabulary list. |
| Billing | Free of charge. |
Context enhancement
Supported scope
Context enhancement is supported by:
- Non-real-time speech recognition: fun-asr-flash-2026-06-15
Quick start
Use case: Best suited for scenarios that combine ASR with large language models. Pass the preceding conversation context (model responses and user speech recognition results) to the ASR model. This significantly improves transcription accuracy for specialized terms such as names, locations, and product terminology, and is more flexible than traditional hotwords.
Usage: Pass conversation history through input.messages. Use the assistant role for prior model responses and the user role with input_text type for prior speech recognition results. Context pairs must appear before the current audio message. For details, see DashScope (Fun-ASR).
Request body structure example:
Effect example
The text field content format is flexible -- it can be a word list, natural language paragraph, or a mix of both. It has high tolerance for unrelated text.
An audio clip should be correctly recognized as: "The jargon within investment banking circles, how much do you know? First, the nine major foreign investment banks, Bulge Bracket, BB ..."
| Without context enhancement | With context enhancement |
|---|---|
| Without context enhancement, some investment bank names are recognized incorrectly. For example, "Bird Rock" should be "Bulge Bracket". Recognition result: "...the nine major foreign investment banks, Bird Rock, BB ..." | With context enhancement, investment bank names are recognized correctly. Recognition result: "...the nine major foreign investment banks, Bulge Bracket, BB ..." |
text field achieves the enhancement effect.
FAQ
Why don't hotwords improve recognition accuracy?
Check the following in order:
- Model mismatch: The
target_modelspecified when creating the list must match the model used by the speech recognition API. A mismatch doesn't cause an error, and recognition still returns results, but the hotwords don't take effect. If the results don't contain expected hotwords, check this first. - Unsupported model: The model must belong to the Fun-ASR or Paraformer family. Other families don't support hotwords. Calling the API with an unsupported model doesn't return an error, but the results may be empty or lack hotword enhancement. If using a model such as SenseVoice, check this first.
- Inappropriate weight: Increase the weight from 4 to 5 and observe the results. If phonetically similar words start being misrecognized as the hotword, reduce it back to 4.
- Hotword list status: Use the Query API to confirm that
statusisOK.
Are hotwords used differently in real-time and file-based recognition?
Hotword lists are created the same way. The calling method differs:
- Real-time speech recognition: Pass
vocabulary_idin the Recognition or WebSocket connection parameters. - File-based speech recognition: Pass
vocabulary_idin the Transcription request parameters.
target_model must match the speech recognition model used in the API call.
How to improve recognition accuracy beyond hotwords?
In addition to hotwords and context enhancement, consider the following:
- Audio quality: Match the sample rate to the model requirements (16 kHz or 8 kHz) and reduce background noise.
- Choose the right model: Different scenarios call for different models. For details, see the Speech-to-text model selection guide.
- Specify the language: Declare the audio language through
language_hintsto improve accuracy in single-language scenarios.