Speech recognition best practices

Speech recognition accuracy depends on three things: audio quality, vocabulary customization, and post-processing. Poor audio is the most common cause of errors. Vocabulary customization handles domain-specific terms. Post-processing catches what the other two miss. This guide covers all three.

Audio quality

The best ASR model in the world cannot transcribe audio it cannot hear clearly. Audio quality is the single biggest factor in recognition accuracy -- and the cheapest to fix.

Recording environment

Factor	Recommendation
Background noise	Record in a quiet room. Avoid air conditioning hum, keyboard clicks, and conversations nearby. If noise is unavoidable, use a directional microphone.
Echo and reverb	Hard walls, glass, and large empty rooms cause echo. Add soft furnishings, use smaller rooms, or use a close-talking microphone.
Speaker distance	Keep 10-30 cm from the microphone. Too close causes plosive distortion ("p" and "b" sounds). Too far picks up room noise.
Speaking style	Speak at a natural pace. Avoid mumbling, trailing off, or speaking over others. For multi-speaker scenarios, leave clear pauses between speakers.

Microphone selection

Different scenarios call for different microphones. The right choice eliminates noise at the source.

Scenario	Recommended mic	Why
Call center / dictation	Headset with boom mic (e.g., Jabra Evolve2, Plantronics)	Close-talking design rejects background noise. Consistent mouth-to-mic distance. Best single-speaker accuracy.
Interviews / presentations	Lavalier (lapel) mic (e.g., Rode Wireless GO, DJI Mic)	Clips to clothing, hands-free. Good for one speaker moving around.
Meeting rooms	Conference array mic (e.g., Jabra Speak, Poly Sync, XMOS array)	Multiple microphones with beamforming. Picks up speakers from all directions while reducing echo.
Mobile apps	Built-in device mic	Acceptable for casual use. For accuracy-critical mobile apps, prompt users to use earbuds with a built-in mic.
Telephony (8kHz)	N/A (carrier audio)	Use 8kHz models. Audio quality is fixed by the phone network.

If you control the recording hardware, invest in a USB headset with noise cancellation. A $50 headset improves accuracy more than any software optimization.

Audio format

Parameter	Recommendation
Sample rate	16 kHz or higher. Use 8 kHz only for telephony recordings. Higher sample rates (44.1 kHz, 48 kHz) are downsampled automatically -- they work fine but cost more bandwidth.
Channels	Mono. Stereo is accepted but the second channel is ignored.
Format	PCM or WAV (lossless). These preserve the full signal.
Compression	Avoid lossy codecs (MP3, AAC, OGG) for accuracy-critical paths. If you must compress, use OPUS at 32 kbps or higher.
Bit depth	16-bit.

Do not apply noise reduction, AGC (automatic gain control), or audio enhancement filters before sending audio to ASR. These filters can remove speech signal along with noise, degrading accuracy. Send the raw audio and let the ASR model handle noise internally.

Vocabulary customization

When your ASR output consistently misrecognizes specific terms -- brand names, product names, technical jargon, proper nouns -- vocabulary customization fixes it. The technique depends on which model family you use.

Technique	Models	How it works
Prompt context	Qwen-Omni	Describe your domain in the system prompt. The model is an LLM that understands audio, so it adapts on every request. Most flexible, but higher per-request latency than dedicated ASR.
Context enhancement	Qwen3-ASR	Pass domain context as a system message. The model adapts dynamically.
Hotwords	Fun-ASR	Create a vocabulary list with weighted terms. Pass the list ID in the ASR call.

Prompt context (Qwen-Omni)

Qwen-Omni is not traditional ASR -- it is an LLM that understands audio. Describe your domain in the system prompt and the model adapts without any pre-configuration. This is the most flexible approach but has higher per-request latency than dedicated ASR models.

Python
Node.js
curl

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-flash",
  messages=[
    {
      "role": "system",
      "content": "You are transcribing investment banking discussions. "
           "Key terms: Bulge Bracket, Boutique, Middle Market, LBO, DCF."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://example.com/meeting.mp3",
            "format": "mp3"
          }
        },
        {"type": "text", "text": "Transcribe this audio."}
      ]
    }
  ],
  stream=True,
  modalities=["text"],
)

for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")

import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});

const completion = await openai.chat.completions.create({
  model: "qwen3-omni-flash",
  messages: [
    {
      role: "system",
      content: "You are transcribing investment banking discussions. "
             + "Key terms: Bulge Bracket, Boutique, Middle Market, LBO, DCF."
    },
    {
      role: "user",
      content: [
        {
          type: "input_audio",
          input_audio: {
            data: "https://example.com/meeting.mp3",
            format: "mp3"
          }
        },
        { type: "text", text: "Transcribe this audio." }
      ]
    }
  ],
  stream: true,
  modalities: ["text"],
});

for await (const chunk of completion) {
  if (chunk.choices?.[0]?.delta?.content) {
    process.stdout.write(chunk.choices[0].delta.content);
  }
}

curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen3-omni-flash",
  "messages": [
    {
      "role": "system",
      "content": "You are transcribing investment banking discussions. Key terms: Bulge Bracket, Boutique, Middle Market, LBO, DCF."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://example.com/meeting.mp3",
            "format": "mp3"
          }
        },
        {
          "type": "text",
          "text": "Transcribe this audio."
        }
      ]
    }
  ],
  "stream": true,
  "modalities": ["text"]
}'

Use prompt context when your domain terminology changes frequently or when you need the model to understand intent beyond raw transcription. For stable term lists with predictable audio, hotwords or context enhancement offer lower latency.

Context enhancement (Qwen3-ASR)

Pass domain-specific context in the system message. The model uses this context to improve recognition of specialized terms -- no vocabulary lists, no weight tuning. It accepts word lists, paragraphs, or a mix of both. Before and after:

Audio content	Without context	With context
Investment banking terminology	"...the top banks, Bird Rock, BB..."	"...the top banks, Bulge Bracket, BB..."

The context used: "Bulge Bracket, Boutique, Middle Market" Supported context formats -- all are equally effective:

Word list: "Bulge Bracket, Boutique, Middle Market" (comma, space, or JSON array separated)
Paragraph: A natural-language description of the domain (e.g., a paragraph about investment banking terminology)
Mixed: Word list + paragraph combined in the same context

Python
Java

import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
  {
    "role": "system",
    "content": [{"text": "Bulge Bracket, Boutique, Middle Market"}]
  },
  {
    "role": "user",
    "content": [{"audio": "https://example.com/investment-banking-audio.mp3"}]
  }
]

response = dashscope.MultiModalConversation.call(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-asr-flash",
  messages=messages,
  result_format="message"
)
print(response.output.choices[0].message.content[0]["text"])

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import java.util.*;

public class ContextEnhancementExample {
  public static void main(String[] args) {
    MultiModalConversation conv = new MultiModalConversation();

    // System message with domain context
    MultiModalMessage sysMessage = MultiModalMessage.builder()
        .role(Role.SYSTEM.getValue())
        .content(Arrays.asList(
          Collections.singletonMap("text", "Bulge Bracket, Boutique, Middle Market")
        ))
        .build();

    // User message with audio
    MultiModalMessage userMessage = MultiModalMessage.builder()
        .role(Role.USER.getValue())
        .content(Arrays.asList(
          Collections.singletonMap("audio",
            "https://example.com/investment-banking-audio.mp3")
        ))
        .build();

    MultiModalConversationParam param = MultiModalConversationParam.builder()
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .model("qwen3-asr-flash")
        .message(sysMessage)
        .message(userMessage)
        .build();

    MultiModalConversationResult result = conv.call(param);
    System.out.println(result.getOutput().getChoices().get(0)
        .getMessage().getContent().get(0).get("text"));
  }
}

Context is limited to 10,000 tokens. The model is robust to irrelevant text in the context -- including unrelated names or descriptions -- with almost no negative impact on accuracy.

Hotwords (Fun-ASR)

Create a vocabulary list of terms you want the model to prioritize, then pass the list ID when calling ASR. This follows a two-step workflow:

Create a hotword list via the Hotwords API. Specify target_model to declare which ASR model will use this list.
Pass the list ID in the ASR call. The model must match the target_model from step 1.

Python
Java

import dashscope
from dashscope.audio.asr import *
import os

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
dashscope.base_websocket_api_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

target_model = "fun-asr-realtime"

# Step 1: Create hotword list
service = VocabularyService()
vocabulary_id = service.create_vocabulary(
  prefix='my-app',
  target_model=target_model,
  vocabulary=[
    {"text": "Bulge Bracket", "weight": 4},
    {"text": "Middle Market", "weight": 4}
  ]
)

# Step 2: Use hotword list for recognition
if service.query_vocabulary(vocabulary_id)['status'] == 'OK':
  recognition = Recognition(
    model=target_model,
    format='wav',
    sample_rate=16000,
    callback=None,
    vocabulary_id=vocabulary_id
  )
  result = recognition.call('audio.wav')
  print(result.output)

# Clean up when no longer needed
service.delete_vocabulary(vocabulary_id)

import com.alibaba.dashscope.audio.asr.recognition.*;
import com.alibaba.dashscope.audio.asr.vocabulary.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;
import java.io.File;

public class HotwordExample {
  public static void main(String[] args) throws Exception {
    String apiKey = System.getenv("DASHSCOPE_API_KEY");
    Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
    Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";

    String targetModel = "fun-asr-realtime";

    // Step 1: Create hotword list
    JsonArray vocab = new JsonArray();
    JsonObject word1 = new JsonObject();
    word1.addProperty("text", "Bulge Bracket");
    word1.addProperty("weight", 4);
    vocab.add(word1);

    VocabularyService service = new VocabularyService(apiKey);
    Vocabulary vocabulary = service.createVocabulary(targetModel, "my-app", vocab);

    // Step 2: Use hotword list for recognition
    if ("OK".equals(service.queryVocabulary(vocabulary.getVocabularyId()).getStatus())) {
      Recognition recognizer = new Recognition();
      RecognitionParam param = RecognitionParam.builder()
          .model(targetModel)
          .apiKey(apiKey)
          .format("wav")
          .sampleRate(16000)
          .build();
      try {
        System.out.println(recognizer.call(param, new File("audio.wav")));
      } finally {
        recognizer.getDuplexApi().close(1000, "bye");
      }
    }

    service.deleteVocabulary(vocabulary.getVocabularyId());
    System.exit(0);
  }
}

Hotword format

Field	Type	Required	Description
`text`	string	Yes	The hotword. Must be a real word or phrase. See length limits below.
`weight`	int	Yes	Priority weight. Range: 1-5. Default: 4.
`lang`	string	No	Language code (e.g., `en`, `zh`). Omit to auto-detect.

Text length limits:

Contains non-ASCII characters: Total character count must not exceed 15.
Pure ASCII: Space-separated word count must not exceed 7.

Weight tuning

Start at weight 4. This works for most terms.
Increase to 5 only if a term is still missed after testing. Higher weight can cause the model to hallucinate the hotword where it doesn't belong.
Don't set all hotwords to weight 5. When everything is high priority, nothing is. Differentiate by actual importance.

The target_model specified when creating the hotword list must match the model used in the ASR call. A mismatch causes the hotwords to have no effect.

For full API details on creating, querying, updating, and deleting hotword vocabularies, see the Hotwords API reference.

Post-processing correction

Vocabulary customization handles known terms. Post-processing catches everything else -- grammar errors, context-dependent words, and ambiguous homophones.

Dual-track pattern

For accuracy-critical applications (medical transcription, legal proceedings, financial reports), combine real-time ASR with LLM post-correction:

Audio Stream
    |
    v
+-------------------+     +-------------------+
| Realtime ASR      | --> | LLM Post-Correct  |
| (low latency)     |     | (high accuracy)   |
+-------------------+     +-------------------+
    |                           |
    v                           v
Display immediately       Replace with corrected text
(draft transcript)        (final transcript)

Fast track: Stream audio to the realtime ASR endpoint (qwen3-asr-flash-realtime or fun-asr-realtime). Display intermediate results immediately.
Correction track: Collect sentence-level ASR output. Send it to an LLM (e.g., qwen-plus) with a domain-aware correction prompt:

Fix any speech recognition errors in the following transcript.
Preserve the original meaning. Domain: [medical/legal/financial].
Transcript: "{asr_output}"

Replace the displayed text with the corrected version once the LLM responds (typically 1-2 seconds).

This pattern is especially valuable when:

Domain-specific terminology is frequently misrecognized even with vocabulary customization.
Accuracy requirements exceed 98%.
A 1-2 second correction delay is acceptable for the user experience.

Measuring accuracy

WER (Word Error Rate): Measure before and after each optimization. Compare a representative set of audio samples against human-verified transcripts.
A/B test: Compare vocabulary customization alone vs. vocabulary customization + dual-track correction.
Per-term tracking: Monitor accuracy for each term in your hotword list or context. Some terms may need weight adjustment or additional context.

High-concurrency ASR

When running ASR at scale, use object pooling to manage Recognition instances. This avoids the overhead of creating a new WebSocket connection per request.

Java: Object pool for recognition

Use an Apache Commons Pool2 object pool to manage Recognition instances. Requires Java SDK >= 2.16.9. Environment variables:

Variable	Default	Recommendation
`DASHSCOPE_CONNECTION_POOL_SIZE`	32	2x peak concurrency
`DASHSCOPE_MAXIMUM_ASYNC_REQUESTS`	32	Match connection pool size
`DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST`	32	Match connection pool size
`RECOGNITION_OBJECTPOOL_SIZE`	500	1.5x-2x peak concurrency

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import org.apache.commons.pool2.BasePooledObjectFactory;
import org.apache.commons.pool2.PooledObject;
import org.apache.commons.pool2.impl.DefaultPooledObject;
import org.apache.commons.pool2.impl.GenericObjectPool;
import org.apache.commons.pool2.impl.GenericObjectPoolConfig;

class RecognitionFactory extends BasePooledObjectFactory<Recognition> {
  public Recognition create() { return new Recognition(); }
  public PooledObject<Recognition> wrap(Recognition obj) {
    return new DefaultPooledObject<>(obj);
  }
}

// Create a global singleton pool
GenericObjectPoolConfig<Recognition> config = new GenericObjectPoolConfig<>();
int poolSize = 500; // Or read from RECOGNITION_OBJECTPOOL_SIZE
config.setMaxTotal(poolSize);
config.setMaxIdle(poolSize);
config.setMinIdle(poolSize);
GenericObjectPool<Recognition> recognitionPool =
  new GenericObjectPool<>(new RecognitionFactory(), config);

// In each task thread:
Recognition recognizer = recognitionPool.borrowObject();
try {
  // ... run recognition task
  recognitionPool.returnObject(recognizer);
} catch (Exception e) {
  // Do not return failed objects
}

The object pool size must be less than or equal to the connection pool size. Otherwise, threads will block while waiting for available connections.

Production checklist

Supported models

Technique	Supported models
Prompt context	`qwen3.5-omni-plus`, `qwen3.5-omni-flash`, `qwen3.5-omni-plus-realtime`, `qwen3.5-omni-flash-realtime`, `qwen3-omni-flash`, `qwen3-omni-flash-realtime`
Context enhancement	`qwen3-asr-flash`, `qwen3-asr-flash-realtime`, `qwen3-asr-plus`, `qwen3-asr-plus-realtime`
Hotwords	`fun-asr-realtime`, `fun-asr` (and dated variants)

For the full model list, see Speech-to-text models.

Pricing

Context enhancement: No additional cost. Standard Qwen3-ASR per-token pricing applies.
Hotwords: Free. No charge for creating or using hotword vocabularies.

Limits

Context enhancement: Maximum 10,000 tokens per system message.
Hotwords: Up to 10 hotword lists per account (shared across models). Up to 500 hotwords per list. Request an increase if needed.

​Audio quality

​Recording environment

​Microphone selection

​Audio format

​Vocabulary customization

​Prompt context (Qwen-Omni)

​Context enhancement (Qwen3-ASR)

​Hotwords (Fun-ASR)

​Hotword format

​Weight tuning

​Post-processing correction

​Dual-track pattern

​Measuring accuracy

​High-concurrency ASR

​Java: Object pool for recognition

​Production checklist

​Supported models

​Pricing

​Limits

Audio quality

Recording environment

Microphone selection

Audio format

Vocabulary customization

Prompt context (Qwen-Omni)

Context enhancement (Qwen3-ASR)

Hotwords (Fun-ASR)

Hotword format

Weight tuning

Post-processing correction

Dual-track pattern

Measuring accuracy

High-concurrency ASR

Java: Object pool for recognition

Production checklist

Supported models

Pricing

Limits