Speech recognition accuracy depends on three things: audio quality, vocabulary customization, and post-processing. Poor audio is the most common cause of errors. Vocabulary customization handles domain-specific terms. Post-processing catches what the other two miss.
This guide covers all three.
Audio quality
The best ASR model in the world cannot transcribe audio it cannot hear clearly. Audio quality is the single biggest factor in recognition accuracy -- and the cheapest to fix.
Recording environment
| Factor | Recommendation |
|---|
| Background noise | Record in a quiet room. Avoid air conditioning hum, keyboard clicks, and conversations nearby. If noise is unavoidable, use a directional microphone. |
| Echo and reverb | Hard walls, glass, and large empty rooms cause echo. Add soft furnishings, use smaller rooms, or use a close-talking microphone. |
| Speaker distance | Keep 10-30 cm from the microphone. Too close causes plosive distortion ("p" and "b" sounds). Too far picks up room noise. |
| Speaking style | Speak at a natural pace. Avoid mumbling, trailing off, or speaking over others. For multi-speaker scenarios, leave clear pauses between speakers. |
Microphone selection
Different scenarios call for different microphones. The right choice eliminates noise at the source.
| Scenario | Recommended mic | Why |
|---|
| Call center / dictation | Headset with boom mic (e.g., Jabra Evolve2, Plantronics) | Close-talking design rejects background noise. Consistent mouth-to-mic distance. Best single-speaker accuracy. |
| Interviews / presentations | Lavalier (lapel) mic (e.g., Rode Wireless GO, DJI Mic) | Clips to clothing, hands-free. Good for one speaker moving around. |
| Meeting rooms | Conference array mic (e.g., Jabra Speak, Poly Sync, XMOS array) | Multiple microphones with beamforming. Picks up speakers from all directions while reducing echo. |
| Mobile apps | Built-in device mic | Acceptable for casual use. For accuracy-critical mobile apps, prompt users to use earbuds with a built-in mic. |
| Telephony (8kHz) | N/A (carrier audio) | Use 8kHz models. Audio quality is fixed by the phone network. |
If you control the recording hardware, invest in a USB headset with noise cancellation. A $50 headset improves accuracy more than any software optimization.
| Parameter | Recommendation |
|---|
| Sample rate | 16 kHz or higher. Use 8 kHz only for telephony recordings. Higher sample rates (44.1 kHz, 48 kHz) are downsampled automatically -- they work fine but cost more bandwidth. |
| Channels | Mono. Stereo is accepted but the second channel is ignored. |
| Format | PCM or WAV (lossless). These preserve the full signal. |
| Compression | Avoid lossy codecs (MP3, AAC, OGG) for accuracy-critical paths. If you must compress, use OPUS at 32 kbps or higher. |
| Bit depth | 16-bit. |
Do not apply noise reduction, AGC (automatic gain control), or audio enhancement filters before sending audio to ASR. These filters can remove speech signal along with noise, degrading accuracy. Send the raw audio and let the ASR model handle noise internally.
Vocabulary customization
When your ASR output consistently misrecognizes specific terms -- brand names, product names, technical jargon, proper nouns -- vocabulary customization fixes it. The technique depends on which model family you use.
| Technique | Models | How it works |
|---|
| Prompt context | Qwen-Omni | Describe your domain in the system prompt. The model is an LLM that understands audio, so it adapts on every request. Most flexible, but higher per-request latency than dedicated ASR. |
| Context enhancement | Qwen3-ASR | Pass domain context as a system message. The model adapts dynamically. |
| Hotwords | Fun-ASR | Create a vocabulary list with weighted terms. Pass the list ID in the ASR call. |
Prompt context (Qwen-Omni)
Qwen-Omni is not traditional ASR -- it is an LLM that understands audio. Describe your domain in the system prompt and the model adapts without any pre-configuration. This is the most flexible approach but has higher per-request latency than dedicated ASR models.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="qwen3-omni-flash",
messages=[
{
"role": "system",
"content": "You are transcribing investment banking discussions. "
"Key terms: Bulge Bracket, Boutique, Middle Market, LBO, DCF."
},
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://example.com/meeting.mp3",
"format": "mp3"
}
},
{"type": "text", "text": "Transcribe this audio."}
]
}
],
stream=True,
modalities=["text"],
)
for chunk in completion:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
import OpenAI from "openai";
const openai = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
});
const completion = await openai.chat.completions.create({
model: "qwen3-omni-flash",
messages: [
{
role: "system",
content: "You are transcribing investment banking discussions. "
+ "Key terms: Bulge Bracket, Boutique, Middle Market, LBO, DCF."
},
{
role: "user",
content: [
{
type: "input_audio",
input_audio: {
data: "https://example.com/meeting.mp3",
format: "mp3"
}
},
{ type: "text", text: "Transcribe this audio." }
]
}
],
stream: true,
modalities: ["text"],
});
for await (const chunk of completion) {
if (chunk.choices?.[0]?.delta?.content) {
process.stdout.write(chunk.choices[0].delta.content);
}
}
curl -X POST https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-omni-flash",
"messages": [
{
"role": "system",
"content": "You are transcribing investment banking discussions. Key terms: Bulge Bracket, Boutique, Middle Market, LBO, DCF."
},
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": "https://example.com/meeting.mp3",
"format": "mp3"
}
},
{
"type": "text",
"text": "Transcribe this audio."
}
]
}
],
"stream": true,
"modalities": ["text"]
}'
Use prompt context when your domain terminology changes frequently or when you need the model to understand intent beyond raw transcription. For stable term lists with predictable audio, hotwords or context enhancement offer lower latency.
Context enhancement (Qwen3-ASR)
Pass domain-specific context in the system message. The model uses this context to improve recognition of specialized terms -- no vocabulary lists, no weight tuning. It accepts word lists, paragraphs, or a mix of both.
Before and after:
| Audio content | Without context | With context |
|---|
| Investment banking terminology | "...the top banks, Bird Rock, BB..." | "...the top banks, Bulge Bracket, BB..." |
The context used: "Bulge Bracket, Boutique, Middle Market"
Supported context formats -- all are equally effective:
- Word list:
"Bulge Bracket, Boutique, Middle Market" (comma, space, or JSON array separated)
- Paragraph: A natural-language description of the domain (e.g., a paragraph about investment banking terminology)
- Mixed: Word list + paragraph combined in the same context
import os
import dashscope
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
messages = [
{
"role": "system",
"content": [{"text": "Bulge Bracket, Boutique, Middle Market"}]
},
{
"role": "user",
"content": [{"audio": "https://example.com/investment-banking-audio.mp3"}]
}
]
response = dashscope.MultiModalConversation.call(
api_key=os.getenv("DASHSCOPE_API_KEY"),
model="qwen3-asr-flash",
messages=messages,
result_format="message"
)
print(response.output.choices[0].message.content[0]["text"])
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import java.util.*;
public class ContextEnhancementExample {
public static void main(String[] args) {
MultiModalConversation conv = new MultiModalConversation();
// System message with domain context
MultiModalMessage sysMessage = MultiModalMessage.builder()
.role(Role.SYSTEM.getValue())
.content(Arrays.asList(
Collections.singletonMap("text", "Bulge Bracket, Boutique, Middle Market")
))
.build();
// User message with audio
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio",
"https://example.com/investment-banking-audio.mp3")
))
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("qwen3-asr-flash")
.message(sysMessage)
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0)
.getMessage().getContent().get(0).get("text"));
}
}
Context is limited to 10,000 tokens. The model is robust to irrelevant text in the context -- including unrelated names or descriptions -- with almost no negative impact on accuracy.
Hotwords (Fun-ASR)
Create a vocabulary list of terms you want the model to prioritize, then pass the list ID when calling ASR. This follows a two-step workflow:
- Create a hotword list via the Hotwords API. Specify
target_model to declare which ASR model will use this list.
- Pass the list ID in the ASR call. The model must match the
target_model from step 1.
import dashscope
from dashscope.audio.asr import *
import os
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
dashscope.base_websocket_api_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
target_model = "fun-asr-realtime"
# Step 1: Create hotword list
service = VocabularyService()
vocabulary_id = service.create_vocabulary(
prefix='my-app',
target_model=target_model,
vocabulary=[
{"text": "Bulge Bracket", "weight": 4},
{"text": "Middle Market", "weight": 4}
]
)
# Step 2: Use hotword list for recognition
if service.query_vocabulary(vocabulary_id)['status'] == 'OK':
recognition = Recognition(
model=target_model,
format='wav',
sample_rate=16000,
callback=None,
vocabulary_id=vocabulary_id
)
result = recognition.call('audio.wav')
print(result.output)
# Clean up when no longer needed
service.delete_vocabulary(vocabulary_id)
import com.alibaba.dashscope.audio.asr.recognition.*;
import com.alibaba.dashscope.audio.asr.vocabulary.*;
import com.alibaba.dashscope.utils.Constants;
import com.google.gson.*;
import java.io.File;
public class HotwordExample {
public static void main(String[] args) throws Exception {
String apiKey = System.getenv("DASHSCOPE_API_KEY");
Constants.baseHttpApiUrl = "https://dashscope-intl.aliyuncs.com/api/v1";
Constants.baseWebsocketApiUrl = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference";
String targetModel = "fun-asr-realtime";
// Step 1: Create hotword list
JsonArray vocab = new JsonArray();
JsonObject word1 = new JsonObject();
word1.addProperty("text", "Bulge Bracket");
word1.addProperty("weight", 4);
vocab.add(word1);
VocabularyService service = new VocabularyService(apiKey);
Vocabulary vocabulary = service.createVocabulary(targetModel, "my-app", vocab);
// Step 2: Use hotword list for recognition
if ("OK".equals(service.queryVocabulary(vocabulary.getVocabularyId()).getStatus())) {
Recognition recognizer = new Recognition();
RecognitionParam param = RecognitionParam.builder()
.model(targetModel)
.apiKey(apiKey)
.format("wav")
.sampleRate(16000)
.build();
try {
System.out.println(recognizer.call(param, new File("audio.wav")));
} finally {
recognizer.getDuplexApi().close(1000, "bye");
}
}
service.deleteVocabulary(vocabulary.getVocabularyId());
System.exit(0);
}
}
| Field | Type | Required | Description |
|---|
text | string | Yes | The hotword. Must be a real word or phrase. See length limits below. |
weight | int | Yes | Priority weight. Range: 1-5. Default: 4. |
lang | string | No | Language code (e.g., en, zh). Omit to auto-detect. |
Text length limits:
- Contains non-ASCII characters: Total character count must not exceed 15.
- Pure ASCII: Space-separated word count must not exceed 7.
Weight tuning
- Start at weight 4. This works for most terms.
- Increase to 5 only if a term is still missed after testing. Higher weight can cause the model to hallucinate the hotword where it doesn't belong.
- Don't set all hotwords to weight 5. When everything is high priority, nothing is. Differentiate by actual importance.
The target_model specified when creating the hotword list must match the model used in the ASR call. A mismatch causes the hotwords to have no effect.
For full API details on creating, querying, updating, and deleting hotword vocabularies, see the Hotwords API reference.
Post-processing correction
Vocabulary customization handles known terms. Post-processing catches everything else -- grammar errors, context-dependent words, and ambiguous homophones.
Dual-track pattern
For accuracy-critical applications (medical transcription, legal proceedings, financial reports), combine real-time ASR with LLM post-correction:
Audio Stream
|
v
+-------------------+ +-------------------+
| Realtime ASR | --> | LLM Post-Correct |
| (low latency) | | (high accuracy) |
+-------------------+ +-------------------+
| |
v v
Display immediately Replace with corrected text
(draft transcript) (final transcript)
- Fast track: Stream audio to the realtime ASR endpoint (
qwen3-asr-flash-realtime or fun-asr-realtime). Display intermediate results immediately.
- Correction track: Collect sentence-level ASR output. Send it to an LLM (e.g.,
qwen-plus) with a domain-aware correction prompt:
Fix any speech recognition errors in the following transcript.
Preserve the original meaning. Domain: [medical/legal/financial].
Transcript: "{asr_output}"
- Replace the displayed text with the corrected version once the LLM responds (typically 1-2 seconds).
This pattern is especially valuable when:
- Domain-specific terminology is frequently misrecognized even with vocabulary customization.
- Accuracy requirements exceed 98%.
- A 1-2 second correction delay is acceptable for the user experience.
Measuring accuracy
- WER (Word Error Rate): Measure before and after each optimization. Compare a representative set of audio samples against human-verified transcripts.
- A/B test: Compare vocabulary customization alone vs. vocabulary customization + dual-track correction.
- Per-term tracking: Monitor accuracy for each term in your hotword list or context. Some terms may need weight adjustment or additional context.
High-concurrency ASR
When running ASR at scale, use object pooling to manage Recognition instances. This avoids the overhead of creating a new WebSocket connection per request.
Java: Object pool for recognition
Use an Apache Commons Pool2 object pool to manage Recognition instances. Requires Java SDK >= 2.16.9.
Environment variables:
| Variable | Default | Recommendation |
|---|
DASHSCOPE_CONNECTION_POOL_SIZE | 32 | 2x peak concurrency |
DASHSCOPE_MAXIMUM_ASYNC_REQUESTS | 32 | Match connection pool size |
DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST | 32 | Match connection pool size |
RECOGNITION_OBJECTPOOL_SIZE | 500 | 1.5x-2x peak concurrency |
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import org.apache.commons.pool2.BasePooledObjectFactory;
import org.apache.commons.pool2.PooledObject;
import org.apache.commons.pool2.impl.DefaultPooledObject;
import org.apache.commons.pool2.impl.GenericObjectPool;
import org.apache.commons.pool2.impl.GenericObjectPoolConfig;
class RecognitionFactory extends BasePooledObjectFactory<Recognition> {
public Recognition create() { return new Recognition(); }
public PooledObject<Recognition> wrap(Recognition obj) {
return new DefaultPooledObject<>(obj);
}
}
// Create a global singleton pool
GenericObjectPoolConfig<Recognition> config = new GenericObjectPoolConfig<>();
int poolSize = 500; // Or read from RECOGNITION_OBJECTPOOL_SIZE
config.setMaxTotal(poolSize);
config.setMaxIdle(poolSize);
config.setMinIdle(poolSize);
GenericObjectPool<Recognition> recognitionPool =
new GenericObjectPool<>(new RecognitionFactory(), config);
// In each task thread:
Recognition recognizer = recognitionPool.borrowObject();
try {
// ... run recognition task
recognitionPool.returnObject(recognizer);
} catch (Exception e) {
// Do not return failed objects
}
The object pool size must be less than or equal to the connection pool size. Otherwise, threads will block while waiting for available connections.
Production checklist
Supported models
| Technique | Supported models |
|---|
| Prompt context | qwen3-omni-flash, qwen3-omni-flash-realtime |
| Context enhancement | qwen3-asr-flash, qwen3-asr-flash-realtime, qwen3-asr-plus, qwen3-asr-plus-realtime |
| Hotwords | fun-asr-realtime, fun-asr (and dated variants) |
For the full model list, see Speech-to-text models.
Pricing
- Context enhancement: No additional cost. Standard Qwen3-ASR per-token pricing applies.
- Hotwords: Free. No charge for creating or using hotword vocabularies.
Limits
- Context enhancement: Maximum 10,000 tokens per system message.
- Hotwords: Up to 10 hotword lists per account (shared across models). Up to 500 hotwords per list. Request an increase if needed.