Improve recognition accuracy

Qwen Cloud provides two methods to improve ASR accuracy: custom hotwords for term-level biasing and context enhancement for conversation-aware recognition.

Feature	How it works	Best for
Custom hotwords	Boost specific terms with priority weights	Fixed terminology: product names, proper nouns, medical terms
Context enhancement	Pass conversation history to the ASR model	Dynamic context: names, locations, domain terms from ongoing conversations

Prerequisites

Get your API key and set it as an environment variable.
Install the DashScope SDK.

Custom hotwords

Supported scope

Hotwords are supported by Fun-ASR models. The following models are available:

Real-time speech recognition: fun-asr-realtime, fun-asr-realtime-2025-11-07
Non-real-time speech recognition: fun-asr, fun-asr-2025-11-07, fun-asr-2025-08-25, fun-asr-mtl, fun-asr-mtl-2025-08-25, fun-asr-flash-2026-06-15

For the full model list, see Speech-to-text models.

Quick start

Workflow:

Create a hotword list: Call the Create API to define a list of hotwords and set target_model to the speech recognition model you plan to use.
Use the hotword list: Pass the hotword list ID (vocabulary_id) in the speech recognition request parameters. Ensure that target_model matches the model being called.

Audio file used in the examples: asr_example.wav.

Python
Java

import dashscope
from dashscope.audio.asr import *
import os

dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
dashscope.base_websocket_api_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
prefix = 'testpfx'
target_model = "fun-asr-realtime"

my_vocabulary = [
  {"text": "Speech Laboratory", "weight": 4}
]

service = VocabularyService()
vocabulary_id = service.create_vocabulary(
  prefix=prefix,
  target_model=target_model,
  vocabulary=my_vocabulary)

try:
  if service.query_vocabulary(vocabulary_id)['status'] == 'OK':
    recognition = Recognition(model=target_model,
                            format='wav',
                            sample_rate=16000,
                            callback=None,
                            vocabulary_id=vocabulary_id)
    result = recognition.call('asr_example.wav')
    print(result.output)
finally:
  service.delete_vocabulary(vocabulary_id)

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.vocabulary.Vocabulary;
import com.alibaba.dashscope.audio.asr.vocabulary.VocabularyService;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.JsonArray;
import com.google.gson.JsonObject;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

public class Main {
  public static String apiKey = System.getenv("DASHSCOPE_API_KEY");

  public static void main(String[] args) throws NoApiKeyException, InputRequiredException {
    Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";

    String targetModel = "fun-asr-realtime";

    JsonArray vocabularyJson = new JsonArray();
    List<Hotword> wordList = new ArrayList<>();
    wordList.add(new Hotword("Speech Laboratory", 4));

    for (Hotword word : wordList) {
      JsonObject jsonObject = new JsonObject();
      jsonObject.addProperty("text", word.text);
      jsonObject.addProperty("weight", word.weight);
      vocabularyJson.add(jsonObject);
    }

    VocabularyService service = new VocabularyService(apiKey);
    Vocabulary vocabulary = service.createVocabulary(targetModel, "testpfx", vocabularyJson);

    try {
      if ("OK".equals(service.queryVocabulary(vocabulary.getVocabularyId()).getStatus())) {
        Recognition recognizer = new Recognition();
        RecognitionParam param =
            RecognitionParam.builder()
                .model(targetModel)
                .apiKey(apiKey)
                .format("wav")
                .sampleRate(16000)
                .vocabularyId(vocabulary.getVocabularyId())
                .build();

        try {
          System.out.println("Recognition result: " + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
          e.printStackTrace();
        } finally {
          recognizer.getDuplexApi().close(1000, "bye");
        }
      }
    } finally {
      service.deleteVocabulary(vocabulary.getVocabularyId());
    }
    System.exit(0);
  }
}

class Hotword {
  String text;
  int weight;

  public Hotword(String text, int weight) {
    this.text = text;
    this.weight = weight;
  }
}

Hotword format

Submit a JSON array of hotword objects. Example: Improve movie title recognition (Fun-ASR and Paraformer series models)

[
  {"text": "赛德克巴莱", "weight": 4, "lang": "zh"},
  {"text": "Seediq Bale", "weight": 4, "lang": "en"},
  {"text": "夏洛特烦恼", "weight": 4, "lang": "zh"},
  {"text": "Goodbye Mr. Loser", "weight": 4, "lang": "en"},
  {"text": "阙里人家", "weight": 4, "lang": "zh"},
  {"text": "Confucius' Family", "weight": 4, "lang": "en"}
]

Field descriptions:

Field	Type	Required	Description
text	string	Yes	The hotword text. Must be supported by the selected model. Use actual words, not random characters. See length rules below.
weight	int	Yes	Priority weight, an integer from 1 to 5. Start with 4. Increase if results are weak, but too high a weight can hurt recognition of other words.
lang	string	No	Language code. Boosts hotwords for a specific language. Leave empty for auto-detection. See the model's API reference for supported codes. If you set `language_hints`, only matching hotwords take effect.

Hotword text length rules:

Contains non-ASCII characters: Maximum 15 characters total, including non-ASCII characters (Chinese, Japanese kana, Korean Hangul, Russian Cyrillic) and ASCII characters. Examples:
- "厄洛替尼盐酸盐" (7 Chinese characters)
- "EGFR抑制剂" (3 Chinese characters and 4 ASCII characters, for a total of 7 characters)
- "こんにちは" (5 characters)
- "Фенибут Белфарм" (15 characters, including the space)
- "Клофелин Белмедпрепараты" (24 characters) -- exceeds limit
Contains only ASCII characters: Maximum 7 segments. A segment is a sequence of characters separated by spaces. Examples:
- "Exothermic reaction" -- 2 segments
- "Human immunodeficiency virus type 1" -- 5 segments
- "The effect of temperature variations on enzyme activity in biochemical reactions" -- 11 segments, exceeds limit

Tune hotword performance

Adjust hotword weights

Weight controls how strongly the model favors a hotword. Set it appropriately to improve target word accuracy without introducing false recognitions.

Weight	Effect	Best for
1-2	Slight preference	Hotwords that sound similar to common words, where overcorrection must be avoided
3-4	Clear preference (recommended)	The best starting point for most scenarios
5	Forced preference	Use only when the term appears frequently in the audio and is unlikely to be confused with other words. An excessively high weight can cause phonetically similar words to be misrecognized as the hotword.

Start with weight=4 and adjust incrementally based on recognition results.

Design hotword lists

Group by scenario: Create separate vocabulary lists for different business scenarios (for example, one for medical terms and another for product names) to simplify maintenance and reuse.
Mix multiple languages: A single vocabulary list can contain terms in different languages. Use the lang field to distinguish them. When language_hints is specified during speech recognition, only hotwords that match the specified language take effect.
Clean up regularly: Delete unused vocabulary lists to free up quota. Each account supports up to 10 lists.

Limits and billing

Limit	Description
Number of vocabulary lists	10 per account, shared across all models.
Hotwords per list	Up to 500 hotwords per vocabulary list.
Billing	Free of charge.

API reference: Custom Hotword API Reference

Context enhancement

Supported scope

Context enhancement is supported by:

Non-real-time speech recognition: fun-asr-flash-2026-06-15

Quick start

Use case: Best suited for scenarios that combine ASR with large language models. Pass the preceding conversation context (model responses and user speech recognition results) to the ASR model. This significantly improves transcription accuracy for specialized terms such as names, locations, and product terminology, and is more flexible than traditional hotwords. Usage: Pass conversation history through input.messages. Use the assistant role for prior model responses and the user role with input_text type for prior speech recognition results. Context pairs must appear before the current audio message. For details, see DashScope (Fun-ASR). Request body structure example:

{
  "model": "fun-asr-flash-2026-06-15",
  "input": {
    "messages": [
      {
        "role": "assistant",
        "content": [
          {
            "type": "text",
            "text": "Prior model response content"
          }
        ]
      },
      {
        "role": "user",
        "content": [
          {
            "type": "input_text",
            "text": "Prior user speech recognition result"
          }
        ]
      },
      {
        "role": "user",
        "content": [
          {
            "type": "input_audio",
            "input_audio": {
              "data": "Audio URL or Base64 of the current audio to recognize"
            }
          }
        ]
      }
    ]
  },
  "parameters": {}
}

Effect example

The text field content format is flexible -- it can be a word list, natural language paragraph, or a mix of both. It has high tolerance for unrelated text. An audio clip should be correctly recognized as: "The jargon within investment banking circles, how much do you know? First, the nine major foreign investment banks, Bulge Bracket, BB ..."

Without context enhancement	With context enhancement
Without context enhancement, some investment bank names are recognized incorrectly. For example, "Bird Rock" should be "Bulge Bracket". Recognition result: "...the nine major foreign investment banks, Bird Rock, BB ..."	With context enhancement, investment bank names are recognized correctly. Recognition result: "...the nine major foreign investment banks, Bulge Bracket, BB ..."

In the example above, adding a word list or natural language paragraph containing terms like "Bulge Bracket" to the text field achieves the enhancement effect.

FAQ

Why don't hotwords improve recognition accuracy?

Check the following in order:

Model mismatch: The target_model specified when creating the list must match the model used by the speech recognition API. A mismatch doesn't cause an error, and recognition still returns results, but the hotwords don't take effect. If the results don't contain expected hotwords, check this first.
Unsupported model: The model must belong to the Fun-ASR or Paraformer family. Other families don't support hotwords. Calling the API with an unsupported model doesn't return an error, but the results may be empty or lack hotword enhancement. If using a model such as SenseVoice, check this first.
Inappropriate weight: Increase the weight from 4 to 5 and observe the results. If phonetically similar words start being misrecognized as the hotword, reduce it back to 4.
Hotword list status: Use the Query API to confirm that status is OK.

Are hotwords used differently in real-time and file-based recognition?

Hotword lists are created the same way. The calling method differs:

Real-time speech recognition: Pass vocabulary_id in the Recognition or WebSocket connection parameters.
File-based speech recognition: Pass vocabulary_id in the Transcription request parameters.

In both cases, target_model must match the speech recognition model used in the API call.

How to improve recognition accuracy beyond hotwords?

In addition to hotwords and context enhancement, consider the following:

Audio quality: Match the sample rate to the model requirements (16 kHz or 8 kHz) and reduce background noise.
Choose the right model: Different scenarios call for different models. For details, see the Speech-to-text model selection guide.
Specify the language: Declare the audio language through language_hints to improve accuracy in single-language scenarios.

​Prerequisites

​Custom hotwords

​Supported scope

​Quick start

​Hotword format

​Tune hotword performance

​Adjust hotword weights

​Design hotword lists

​Limits and billing

​Context enhancement

​Supported scope

​Quick start

​Effect example

​FAQ

​Why don't hotwords improve recognition accuracy?

​Are hotwords used differently in real-time and file-based recognition?

​How to improve recognition accuracy beyond hotwords?

Prerequisites

Custom hotwords

Supported scope

Quick start

Hotword format

Tune hotword performance

Adjust hotword weights

Design hotword lists

Limits and billing

Context enhancement

Supported scope

Quick start

Effect example

FAQ

Why don't hotwords improve recognition accuracy?

Are hotwords used differently in real-time and file-based recognition?

How to improve recognition accuracy beyond hotwords?