Skip to main content
Accuracy tuning

Speech recognition best practices

Optimize ASR quality through audio best practices, vocabulary customization, and post-processing correction.

Speech recognition accuracy depends on three things: audio quality, vocabulary customization, and post-processing. Poor audio is the most common cause of errors. Vocabulary customization handles domain-specific terms. Post-processing catches what the other two miss. This guide covers all three.

Audio quality

The best ASR model in the world cannot transcribe audio it cannot hear clearly. Audio quality is the single biggest factor in recognition accuracy -- and the cheapest to fix.

Recording environment

FactorRecommendation
Background noiseRecord in a quiet room. Avoid air conditioning hum, keyboard clicks, and conversations nearby. If noise is unavoidable, use a directional microphone.
Echo and reverbHard walls, glass, and large empty rooms cause echo. Add soft furnishings, use smaller rooms, or use a close-talking microphone.
Speaker distanceKeep 10-30 cm from the microphone. Too close causes plosive distortion ("p" and "b" sounds). Too far picks up room noise.
Speaking styleSpeak at a natural pace. Avoid mumbling, trailing off, or speaking over others. For multi-speaker scenarios, leave clear pauses between speakers.

Microphone selection

Different scenarios call for different microphones. The right choice eliminates noise at the source.
ScenarioRecommended micWhy
Call center / dictationHeadset with boom mic (e.g., Jabra Evolve2, Plantronics)Close-talking design rejects background noise. Consistent mouth-to-mic distance. Best single-speaker accuracy.
Interviews / presentationsLavalier (lapel) mic (e.g., Rode Wireless GO, DJI Mic)Clips to clothing, hands-free. Good for one speaker moving around.
Meeting roomsConference array mic (e.g., Jabra Speak, Poly Sync, XMOS array)Multiple microphones with beamforming. Picks up speakers from all directions while reducing echo.
Mobile appsBuilt-in device micAcceptable for casual use. For accuracy-critical mobile apps, prompt users to use earbuds with a built-in mic.
Telephony (8kHz)N/A (carrier audio)Use 8kHz models. Audio quality is fixed by the phone network.
If you control the recording hardware, invest in a USB headset with noise cancellation. A $50 headset improves accuracy more than any software optimization.

Audio format

ParameterRecommendation
Sample rate16 kHz or higher. Use 8 kHz only for telephony recordings. Higher sample rates (44.1 kHz, 48 kHz) are downsampled automatically -- they work fine but cost more bandwidth.
ChannelsMono. Stereo is accepted but the second channel is ignored.
FormatPCM or WAV (lossless). These preserve the full signal.
CompressionAvoid lossy codecs (MP3, AAC, OGG) for accuracy-critical paths. If you must compress, use OPUS at 32 kbps or higher.
Bit depth16-bit.
Do not apply noise reduction, AGC (automatic gain control), or audio enhancement filters before sending audio to ASR. These filters can remove speech signal along with noise, degrading accuracy. Send the raw audio and let the ASR model handle noise internally.

Vocabulary customization

When your ASR output consistently misrecognizes specific terms -- brand names, product names, technical jargon, proper nouns -- vocabulary customization fixes it. The technique depends on which model family you use.
TechniqueModelsHow it works
Prompt contextQwen-OmniDescribe your domain in the system prompt. The model is an LLM that understands audio, so it adapts on every request. Most flexible, but higher per-request latency than dedicated ASR.
Context enhancementQwen3-ASRPass domain context as a system message. The model adapts dynamically.
HotwordsFun-ASRCreate a vocabulary list with weighted terms. Pass the list ID in the ASR call.

Prompt context (Qwen-Omni)

Qwen-Omni is not traditional ASR -- it is an LLM that understands audio. Describe your domain in the system prompt and the model adapts without any pre-configuration. This is the most flexible approach but has higher per-request latency than dedicated ASR models.
  • Python
  • Node.js
  • curl
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-flash",
  messages=[
    {
      "role": "system",
      "content": "You are transcribing investment banking discussions. "
           "Key terms: Bulge Bracket, Boutique, Middle Market, LBO, DCF."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://example.com/meeting.mp3",
            "format": "mp3"
          }
        },
        {"type": "text", "text": "Transcribe this audio."}
      ]
    }
  ],
  stream=True,
  modalities=["text"],
)

for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")
Use prompt context when your domain terminology changes frequently or when you need the model to understand intent beyond raw transcription. For stable term lists with predictable audio, hotwords or context enhancement offer lower latency.

Context enhancement (Qwen3-ASR)

Pass domain-specific context in the system message. The model uses this context to improve recognition of specialized terms -- no vocabulary lists, no weight tuning. It accepts word lists, paragraphs, or a mix of both. Before and after:
Audio contentWithout contextWith context
Investment banking terminology"...the top banks, Bird Rock, BB...""...the top banks, Bulge Bracket, BB..."
The context used: "Bulge Bracket, Boutique, Middle Market" Supported context formats -- all are equally effective:
  • Word list: "Bulge Bracket, Boutique, Middle Market" (comma, space, or JSON array separated)
  • Paragraph: A natural-language description of the domain (e.g., a paragraph about investment banking terminology)
  • Mixed: Word list + paragraph combined in the same context
  • Python
  • Java
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

messages = [
  {
    "role": "system",
    "content": [{"text": "Bulge Bracket, Boutique, Middle Market"}]
  },
  {
    "role": "user",
    "content": [{"audio": "https://example.com/investment-banking-audio.mp3"}]
  }
]

response = dashscope.MultiModalConversation.call(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-asr-flash",
  messages=messages,
  result_format="message"
)
print(response.output.choices[0].message.content[0]["text"])
Context is limited to 10,000 tokens. The model is robust to irrelevant text in the context -- including unrelated names or descriptions -- with almost no negative impact on accuracy.

Hotwords (Fun-ASR)

Create a vocabulary list of terms you want the model to prioritize, then pass the list ID when calling ASR. This follows a two-step workflow:
  1. Create a hotword list via the Hotwords API. Specify target_model to declare which ASR model will use this list.
  2. Pass the list ID in the ASR call. The model must match the target_model from step 1.
  • Python
  • Java
import dashscope
from dashscope.audio.asr import *
import os

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
dashscope.base_websocket_api_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

target_model = "fun-asr-realtime"

# Step 1: Create hotword list
service = VocabularyService()
vocabulary_id = service.create_vocabulary(
  prefix='my-app',
  target_model=target_model,
  vocabulary=[
    {"text": "Bulge Bracket", "weight": 4},
    {"text": "Middle Market", "weight": 4}
  ]
)

# Step 2: Use hotword list for recognition
if service.query_vocabulary(vocabulary_id)['status'] == 'OK':
  recognition = Recognition(
    model=target_model,
    format='wav',
    sample_rate=16000,
    callback=None,
    vocabulary_id=vocabulary_id
  )
  result = recognition.call('audio.wav')
  print(result.output)

# Clean up when no longer needed
service.delete_vocabulary(vocabulary_id)

Hotword format

FieldTypeRequiredDescription
textstringYesThe hotword. Must be a real word or phrase. See length limits below.
weightintYesPriority weight. Range: 1-5. Default: 4.
langstringNoLanguage code (e.g., en, zh). Omit to auto-detect.
Text length limits:
  • Contains non-ASCII characters: Total character count must not exceed 15.
  • Pure ASCII: Space-separated word count must not exceed 7.

Weight tuning

  • Start at weight 4. This works for most terms.
  • Increase to 5 only if a term is still missed after testing. Higher weight can cause the model to hallucinate the hotword where it doesn't belong.
  • Don't set all hotwords to weight 5. When everything is high priority, nothing is. Differentiate by actual importance.
The target_model specified when creating the hotword list must match the model used in the ASR call. A mismatch causes the hotwords to have no effect.
For full API details on creating, querying, updating, and deleting hotword vocabularies, see the Hotwords API reference.

Post-processing correction

Vocabulary customization handles known terms. Post-processing catches everything else -- grammar errors, context-dependent words, and ambiguous homophones.

Dual-track pattern

For accuracy-critical applications (medical transcription, legal proceedings, financial reports), combine real-time ASR with LLM post-correction:
Audio Stream
    |
    v
+-------------------+     +-------------------+
| Realtime ASR      | --> | LLM Post-Correct  |
| (low latency)     |     | (high accuracy)   |
+-------------------+     +-------------------+
    |                           |
    v                           v
Display immediately       Replace with corrected text
(draft transcript)        (final transcript)
  1. Fast track: Stream audio to the realtime ASR endpoint (qwen3-asr-flash-realtime or fun-asr-realtime). Display intermediate results immediately.
  2. Correction track: Collect sentence-level ASR output. Send it to an LLM (e.g., qwen-plus) with a domain-aware correction prompt:
Fix any speech recognition errors in the following transcript.
Preserve the original meaning. Domain: [medical/legal/financial].
Transcript: "{asr_output}"
  1. Replace the displayed text with the corrected version once the LLM responds (typically 1-2 seconds).
This pattern is especially valuable when:
  • Domain-specific terminology is frequently misrecognized even with vocabulary customization.
  • Accuracy requirements exceed 98%.
  • A 1-2 second correction delay is acceptable for the user experience.

Measuring accuracy

  • WER (Word Error Rate): Measure before and after each optimization. Compare a representative set of audio samples against human-verified transcripts.
  • A/B test: Compare vocabulary customization alone vs. vocabulary customization + dual-track correction.
  • Per-term tracking: Monitor accuracy for each term in your hotword list or context. Some terms may need weight adjustment or additional context.

High-concurrency ASR

When running ASR at scale, use object pooling to manage Recognition instances. This avoids the overhead of creating a new WebSocket connection per request.

Java: Object pool for recognition

Use an Apache Commons Pool2 object pool to manage Recognition instances. Requires Java SDK >= 2.16.9. Environment variables:
VariableDefaultRecommendation
DASHSCOPE_CONNECTION_POOL_SIZE322x peak concurrency
DASHSCOPE_MAXIMUM_ASYNC_REQUESTS32Match connection pool size
DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST32Match connection pool size
RECOGNITION_OBJECTPOOL_SIZE5001.5x-2x peak concurrency
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import org.apache.commons.pool2.BasePooledObjectFactory;
import org.apache.commons.pool2.PooledObject;
import org.apache.commons.pool2.impl.DefaultPooledObject;
import org.apache.commons.pool2.impl.GenericObjectPool;
import org.apache.commons.pool2.impl.GenericObjectPoolConfig;

class RecognitionFactory extends BasePooledObjectFactory<Recognition> {
  public Recognition create() { return new Recognition(); }
  public PooledObject<Recognition> wrap(Recognition obj) {
    return new DefaultPooledObject<>(obj);
  }
}

// Create a global singleton pool
GenericObjectPoolConfig<Recognition> config = new GenericObjectPoolConfig<>();
int poolSize = 500; // Or read from RECOGNITION_OBJECTPOOL_SIZE
config.setMaxTotal(poolSize);
config.setMaxIdle(poolSize);
config.setMinIdle(poolSize);
GenericObjectPool<Recognition> recognitionPool =
  new GenericObjectPool<>(new RecognitionFactory(), config);

// In each task thread:
Recognition recognizer = recognitionPool.borrowObject();
try {
  // ... run recognition task
  recognitionPool.returnObject(recognizer);
} catch (Exception e) {
  // Do not return failed objects
}
The object pool size must be less than or equal to the connection pool size. Otherwise, threads will block while waiting for available connections.

Production checklist

  • Audio captured with appropriate microphone for the scenario
  • Audio format is lossless (PCM/WAV), mono, 16 kHz+
  • No pre-processing filters (noise reduction, AGC) applied to audio
  • Vocabulary customization configured (context enhancement or hotwords, matching your model family)
  • For hotwords: target_model matches the ASR model in use
  • For hotwords: weights tuned incrementally, not all set to maximum
  • WER measured on representative audio samples
  • Dual-track pattern evaluated for accuracy-critical paths
  • Connection pool and object pool sizes configured for expected peak load
  • Error handling returns failed objects to disposal (not back to pool)
  • Fallback handling for ASR service unavailability
  • Monitoring tracks recognition accuracy over time

Supported models

TechniqueSupported models
Prompt contextqwen3-omni-flash, qwen3-omni-flash-realtime
Context enhancementqwen3-asr-flash, qwen3-asr-flash-realtime, qwen3-asr-plus, qwen3-asr-plus-realtime
Hotwordsfun-asr-realtime, fun-asr (and dated variants)
For the full model list, see Speech-to-text models.

Pricing

  • Context enhancement: No additional cost. Standard Qwen3-ASR per-token pricing applies.
  • Hotwords: Free. No charge for creating or using hotword vocabularies.

Limits

  • Context enhancement: Maximum 10,000 tokens per system message.
  • Hotwords: Up to 10 hotword lists per account (shared across models). Up to 500 hotwords per list. Request an increase if needed.