Skip to main content
Speech-to-text

Improve recognition accuracy

Improve speech recognition accuracy using custom hotwords and context enhancement.

Qwen Cloud provides two methods to improve ASR accuracy: custom hotwords for term-level biasing and context enhancement for conversation-aware recognition.
FeatureHow it worksBest for
Custom hotwordsBoost specific terms with priority weightsFixed terminology: product names, proper nouns, medical terms
Context enhancementPass conversation history to the ASR modelDynamic context: names, locations, domain terms from ongoing conversations

Prerequisites

  1. Get your API key and set it as an environment variable.
  2. Install the DashScope SDK.

Custom hotwords

Supported scope

Hotwords are supported by Fun-ASR models. The following models are available:
  • Real-time speech recognition: fun-asr-realtime, fun-asr-realtime-2025-11-07
  • Non-real-time speech recognition: fun-asr, fun-asr-2025-11-07, fun-asr-2025-08-25, fun-asr-mtl, fun-asr-mtl-2025-08-25, fun-asr-flash-2026-06-15
For the full model list, see Speech-to-text models.

Quick start

Workflow:
  1. Create a hotword list: Call the Create API to define a list of hotwords and set target_model to the speech recognition model you plan to use.
  2. Use the hotword list: Pass the hotword list ID (vocabulary_id) in the speech recognition request parameters. Ensure that target_model matches the model being called.
Audio file used in the examples: asr_example.wav.
  • Python
  • Java
import dashscope
from dashscope.audio.asr import *
import os

dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
dashscope.base_websocket_api_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
prefix = 'testpfx'
target_model = "fun-asr-realtime"

my_vocabulary = [
  {"text": "Speech Laboratory", "weight": 4}
]

service = VocabularyService()
vocabulary_id = service.create_vocabulary(
  prefix=prefix,
  target_model=target_model,
  vocabulary=my_vocabulary)

try:
  if service.query_vocabulary(vocabulary_id)['status'] == 'OK':
    recognition = Recognition(model=target_model,
                            format='wav',
                            sample_rate=16000,
                            callback=None,
                            vocabulary_id=vocabulary_id)
    result = recognition.call('asr_example.wav')
    print(result.output)
finally:
  service.delete_vocabulary(vocabulary_id)

Hotword format

Submit a JSON array of hotword objects. Example: Improve movie title recognition (Fun-ASR and Paraformer series models)
[
  {"text": "赛德克巴莱", "weight": 4, "lang": "zh"},
  {"text": "Seediq Bale", "weight": 4, "lang": "en"},
  {"text": "夏洛特烦恼", "weight": 4, "lang": "zh"},
  {"text": "Goodbye Mr. Loser", "weight": 4, "lang": "en"},
  {"text": "阙里人家", "weight": 4, "lang": "zh"},
  {"text": "Confucius' Family", "weight": 4, "lang": "en"}
]
Field descriptions:
FieldTypeRequiredDescription
textstringYesThe hotword text. Must be supported by the selected model. Use actual words, not random characters. See length rules below.
weightintYesPriority weight, an integer from 1 to 5. Start with 4. Increase if results are weak, but too high a weight can hurt recognition of other words.
langstringNoLanguage code. Boosts hotwords for a specific language. Leave empty for auto-detection. See the model's API reference for supported codes. If you set language_hints, only matching hotwords take effect.
Hotword text length rules:
  • Contains non-ASCII characters: Maximum 15 characters total, including non-ASCII characters (Chinese, Japanese kana, Korean Hangul, Russian Cyrillic) and ASCII characters. Examples:
    • "厄洛替尼盐酸盐" (7 Chinese characters)
    • "EGFR抑制剂" (3 Chinese characters and 4 ASCII characters, for a total of 7 characters)
    • "こんにちは" (5 characters)
    • "Фенибут Белфарм" (15 characters, including the space)
    • "Клофелин Белмедпрепараты" (24 characters) -- exceeds limit
  • Contains only ASCII characters: Maximum 7 segments. A segment is a sequence of characters separated by spaces. Examples:
    • "Exothermic reaction" -- 2 segments
    • "Human immunodeficiency virus type 1" -- 5 segments
    • "The effect of temperature variations on enzyme activity in biochemical reactions" -- 11 segments, exceeds limit

Tune hotword performance

Adjust hotword weights

Weight controls how strongly the model favors a hotword. Set it appropriately to improve target word accuracy without introducing false recognitions.
WeightEffectBest for
1-2Slight preferenceHotwords that sound similar to common words, where overcorrection must be avoided
3-4Clear preference (recommended)The best starting point for most scenarios
5Forced preferenceUse only when the term appears frequently in the audio and is unlikely to be confused with other words. An excessively high weight can cause phonetically similar words to be misrecognized as the hotword.
Start with weight=4 and adjust incrementally based on recognition results.

Design hotword lists

  • Group by scenario: Create separate vocabulary lists for different business scenarios (for example, one for medical terms and another for product names) to simplify maintenance and reuse.
  • Mix multiple languages: A single vocabulary list can contain terms in different languages. Use the lang field to distinguish them. When language_hints is specified during speech recognition, only hotwords that match the specified language take effect.
  • Clean up regularly: Delete unused vocabulary lists to free up quota. Each account supports up to 10 lists.

Limits and billing

LimitDescription
Number of vocabulary lists10 per account, shared across all models.
Hotwords per listUp to 500 hotwords per vocabulary list.
BillingFree of charge.
API reference: Custom Hotword API Reference

Context enhancement

Supported scope

Context enhancement is supported by:
  • Non-real-time speech recognition: fun-asr-flash-2026-06-15

Quick start

Use case: Best suited for scenarios that combine ASR with large language models. Pass the preceding conversation context (model responses and user speech recognition results) to the ASR model. This significantly improves transcription accuracy for specialized terms such as names, locations, and product terminology, and is more flexible than traditional hotwords. Usage: Pass conversation history through input.messages. Use the assistant role for prior model responses and the user role with input_text type for prior speech recognition results. Context pairs must appear before the current audio message. For details, see DashScope (Fun-ASR). Request body structure example:
{
  "model": "fun-asr-flash-2026-06-15",
  "input": {
    "messages": [
      {
        "role": "assistant",
        "content": [
          {
            "type": "text",
            "text": "Prior model response content"
          }
        ]
      },
      {
        "role": "user",
        "content": [
          {
            "type": "input_text",
            "text": "Prior user speech recognition result"
          }
        ]
      },
      {
        "role": "user",
        "content": [
          {
            "type": "input_audio",
            "input_audio": {
              "data": "Audio URL or Base64 of the current audio to recognize"
            }
          }
        ]
      }
    ]
  },
  "parameters": {}
}

Effect example

The text field content format is flexible -- it can be a word list, natural language paragraph, or a mix of both. It has high tolerance for unrelated text. An audio clip should be correctly recognized as: "The jargon within investment banking circles, how much do you know? First, the nine major foreign investment banks, Bulge Bracket, BB ..."
Without context enhancementWith context enhancement
Without context enhancement, some investment bank names are recognized incorrectly. For example, "Bird Rock" should be "Bulge Bracket". Recognition result: "...the nine major foreign investment banks, Bird Rock, BB ..."With context enhancement, investment bank names are recognized correctly. Recognition result: "...the nine major foreign investment banks, Bulge Bracket, BB ..."
In the example above, adding a word list or natural language paragraph containing terms like "Bulge Bracket" to the text field achieves the enhancement effect.

FAQ

Why don't hotwords improve recognition accuracy?

Check the following in order:
  1. Model mismatch: The target_model specified when creating the list must match the model used by the speech recognition API. A mismatch doesn't cause an error, and recognition still returns results, but the hotwords don't take effect. If the results don't contain expected hotwords, check this first.
  2. Unsupported model: The model must belong to the Fun-ASR or Paraformer family. Other families don't support hotwords. Calling the API with an unsupported model doesn't return an error, but the results may be empty or lack hotword enhancement. If using a model such as SenseVoice, check this first.
  3. Inappropriate weight: Increase the weight from 4 to 5 and observe the results. If phonetically similar words start being misrecognized as the hotword, reduce it back to 4.
  4. Hotword list status: Use the Query API to confirm that status is OK.

Are hotwords used differently in real-time and file-based recognition?

Hotword lists are created the same way. The calling method differs:
  • Real-time speech recognition: Pass vocabulary_id in the Recognition or WebSocket connection parameters.
  • File-based speech recognition: Pass vocabulary_id in the Transcription request parameters.
In both cases, target_model must match the speech recognition model used in the API call.

How to improve recognition accuracy beyond hotwords?

In addition to hotwords and context enhancement, consider the following:
  • Audio quality: Match the sample rate to the model requirements (16 kHz or 8 kHz) and reduce background noise.
  • Choose the right model: Different scenarios call for different models. For details, see the Speech-to-text model selection guide.
  • Specify the language: Declare the audio language through language_hints to improve accuracy in single-language scenarios.