Skip to main content
Accuracy tuning

Accuracy tuning

Maximize correctness and consistent behavior across text, image, video, speech, and vision models on Qwen Cloud

"Accuracy" means different things for different modalities: for text generation it means factually correct and well-formatted answers; for image and video generation it means faithfully matching the prompt's intent with high visual quality; for speech synthesis it means natural-sounding audio with the right voice characteristics; and for vision or omni models it means correctly interpreting visual inputs. The underlying optimization approach, however, is the same across all of them. This guide walks through a structured approach to diagnosing and fixing accuracy issues — from prompt engineering to RAG to fine-tuning — so you can move from a working prototype to a reliable production system. While much of the detailed guidance focuses on text generation, each section includes strategies for other modalities where applicable.

The optimization framework

Think of accuracy optimization along two axes:
  • Context optimization — the model lacks knowledge or guidance. For text models, fix this with better prompts or retrieval-augmented generation. For image/video models, fix this with more descriptive prompts, reference images, or style descriptors. For speech, fix this with model selection and voice configuration. This maximizes correctness.
  • Model optimization — the model behaves inconsistently. For text models, fix this with few-shot examples or fine-tuning. For generation models, fix this with seed pinning and consistent parameter sets. This maximizes consistency.
Most real-world applications need improvements on both axes. A typical optimization journey looks like this:
  1. Start with prompt engineering — get the best results possible with clear instructions and examples
  2. Build an evaluation set to measure progress objectively
  3. Diagnose whether the problem is missing knowledge, inconsistent behavior, or both
  4. Add RAG if the model needs external information
  5. Add fine-tuning if the model needs to learn new patterns
  6. Combine techniques and iterate

Prompt engineering

Prompt engineering is always the first step. It requires no infrastructure, gives immediate feedback, and often solves more than you expect.
StrategyWhen to useExample
Be specificModel gives vague or off-target answersInstead of "Summarize this", say "Summarize this article in 3 bullet points, each under 20 words"
Add contextModel lacks domain knowledgeInclude relevant reference text, definitions, or constraints in the system prompt
Use few-shot examplesModel format or style is inconsistentProvide 2-5 input/output examples that demonstrate the exact behavior you want
Chain of thoughtModel makes reasoning errorsAdd "Think step by step" or provide a worked example showing the reasoning process
Set constraintsModel produces unwanted contentExplicitly state what to include, exclude, and how to handle edge cases
For detailed prompt engineering techniques, see Text generation prompt engineering.
Long-context models like qwen3-max support inputs up to 256K tokens, but information placed in the middle of very long contexts may receive less attention than information at the beginning or end. Put your most important context — instructions, key reference material — at the start or end of the prompt.

Prompt engineering for other modalities

Image generation
  • Use a structured prompt formula: Subject + Setting + Style + Mood + Technical details. For example, "A golden retriever sitting in a sunlit garden, watercolor style, warm and peaceful, soft focus background." See Image generation prompt engineering for detailed techniques.
  • Use negative prompts to exclude common defects such as blurriness, extra limbs, or text artifacts.
  • The prompt_extend (prompt rewriting) feature can enrich short prompts automatically, but adds 3-5 seconds of latency. Enable it during exploration, disable it in production when you have well-crafted prompts.
  • Model selection matters: Qwen-Image excels at text rendering within images, while Wan produces more photorealistic results. See Text-to-image for model comparisons.
Video generation
  • Structure prompts as Entity + Scene + Motion. Explicitly describe the motion you want -- the model will not infer movement from a static scene description. See Video generation prompt engineering.
  • Use multi-shot mode to maintain subject consistency across scenes. See Text-to-video.
  • Use negative prompts to suppress visual artifacts and fix a seed value to ensure reproducible results during iteration.
Vision and omni models
  • Optimize input image and video quality: crop to the region of interest and ensure adequate resolution for the details you need the model to analyze.
  • Place the most critical question at the beginning or end of your prompt for best attention.
  • For text extraction from images, use the dedicated OCR endpoint for higher accuracy. See OCR and text extraction.
Speech
  • For text-to-speech (TTS), match the model to your scenario: use instruct models for emotional expression and fine-grained control, flash models for low-latency short text, and vd models when you need to design a custom voice from a text description. See Speech synthesis.
  • With instruct TTS models, use natural language instructions to control speech rate, emotion, and character (such as "Speak slowly in a warm, friendly tone").
  • For automatic speech recognition (ASR), ensure audio input is clear with minimal background noise. For large files, pass a URL rather than Base64-encoded data to avoid request size limits. See Speech recognition.

Build an evaluation set

Before optimizing, you need a way to measure improvement. Build an evaluation set of 30-50 question-answer pairs that represent your real use case. Each pair should include:
  • Input: The exact prompt (including system message and any context) you would send to the model
  • Expected output: The correct or ideal response
  • Evaluation criteria: How you judge whether the response is acceptable
Run your evaluation set against the current model to establish a baseline score. Then re-run it after each change to confirm the change actually helps. Automated evaluation approaches Manual review does not scale. Consider these automated approaches:
ApproachBest forHow it works
Exact matchClassification, extraction, yes/no tasksCompare model output directly against the expected answer
ROUGE / BERTScoreSummarization, translationMeasure token overlap or semantic similarity between model output and reference
LLM-as-judgeOpen-ended generation, complex tasksUse a capable model like qwen3-max to score outputs against your rubric
Task-specific metricsDomain-specific tasksCustom metrics (such as valid JSON, correct SQL syntax, entity coverage)
For tasks where correctness is nuanced, LLM-as-judge with a clear scoring rubric often provides the best signal-to-noise ratio.

Evaluating non-text modalities

  • Image and video generation: Score outputs on a 1-5 scale across dimensions such as prompt faithfulness, visual quality, text clarity (if applicable), and absence of artifacts. Build a rubric with reference examples for each score level.
  • Speech (TTS): Use Mean Opinion Score (MOS) testing across naturalness, intelligibility, and voice similarity (for cloned voices). Have multiple raters evaluate each sample.
  • Speech (ASR): Measure Word Error Rate (WER) against ground-truth transcriptions on a representative audio test set.
  • Vision and omni: Use an LLM-as-judge approach -- send the model's output along with the source image and expected answer to a capable text model for automated scoring.

Diagnose the problem

Once you have evaluation results, categorize failures into two types:
  • Missing knowledge (context problem) — The model does not have the information it needs. For text, it may hallucinate facts or give generic answers. For image/video generation, this manifests as prompts that lack sufficient detail, producing outputs that miss the intended subject, style, or composition. Fix this with RAG (for text/vision) or richer prompts and reference media (for generation tasks).
  • Inconsistent behavior (learning problem) — The model has the information but does not reliably use it. For text, it may format outputs inconsistently or ignore instructions. For image/video generation, this manifests as varying styles, compositions, or quality across runs with the same prompt. Fix this with few-shot examples, fine-tuning (text), or seed pinning and parameter stabilization (generation).
These categories are complementary, not mutually exclusive. A single application can have both problems — and often does. Diagnose each failure case individually to choose the right fix.

Retrieval-augmented generation (RAG)

RAG injects relevant external knowledge into each request at inference time. Instead of relying solely on what the model learned during training, you retrieve the most relevant documents from your knowledge base and include them in the prompt. RAG primarily applies to text and vision tasks. For image and video generation, the equivalent approach is providing reference images, detailed style descriptions, or audio samples to guide the output. Use RAG when:
  • The model needs information that changes frequently (product catalogs, knowledge bases, policies)
  • The model needs proprietary or domain-specific data not in its training set
  • You need to cite sources or ground responses in specific documents
  • For vision tasks, the model needs to analyze images against a reference knowledge base (such as identifying products from a catalog, verifying document formats, or comparing visual assets against brand guidelines)

Evaluate your RAG pipeline

RAG accuracy depends on two things: retrieval quality and generation quality. Evaluate them separately:
MetricWhat it measuresHow to check
Retrieval precisionAre the retrieved documents relevant?Sample 50 queries, check if top-3 results contain the needed information
Retrieval recallAre all relevant documents found?For known-answer questions, verify the source document appears in results
Generation faithfulnessDoes the model stick to retrieved content?Check if answers are grounded in the provided context, not hallucinated
End-to-end accuracyDoes the full pipeline give correct answers?Run your evaluation set through the complete RAG pipeline

Retrieval optimization tips

  • Chunk size matters — Chunks that are too small lose context; chunks that are too large dilute relevance. Start with 200-500 tokens and experiment.
  • Use reranking — A two-stage pipeline (fast retrieval then precise reranking) significantly improves precision without sacrificing recall.
  • Embed queries and documents consistently — Use the same embedding model and preprocessing for both indexing and querying.
  • Add metadata filters — Filter by date, category, or source before semantic search to narrow the candidate set.
To build RAG pipelines with Qwen Cloud:
  • Text embedding — Generate embeddings for semantic search
  • Reranking — Improve retrieval precision with a reranking model

Combining techniques

The most effective production systems combine prompt engineering and RAG:
  • Prompt engineering + RAG — Clear instructions tell the model how to use the retrieved context. Few-shot examples show the expected format.
  • Structured output + RAG — Use structured output to enforce consistent response formats when processing retrieved documents.
When combining techniques, keep two principles in mind:
  1. More context can add noise. Retrieving too many documents or stuffing too much information into the prompt can confuse the model. Always validate that adding context actually improves your evaluation scores.
  2. Exhaust simpler methods first. Prompt engineering is fast and free. RAG requires moderate effort. Move to the next technique only when the current one plateaus.

How much accuracy is good enough

Business perspective

Perfect accuracy is rarely achievable or necessary. The right target depends on the cost of errors versus the cost of improvement. Consider a customer service scenario:
Accuracy levelBehaviorBusiness impact
80%Handles common questions correctly, fails on edge casesReduces support volume but requires human review for ~20% of queries
90%Handles most questions including some edge casesSignificant cost savings, occasional escalations
95%Handles nearly all questions correctlyNear-full automation, rare human intervention
99%Near-perfect responsesDiminishing returns — the last 4% may cost more than the first 95%
The jump from 90% to 95% often requires more effort than the jump from 0% to 90%. Decide where the marginal cost of improvement exceeds the marginal value.

Technical perspective

Design your system to fail gracefully when the model gets it wrong:
  • Confidence thresholds — If the model is uncertain, escalate to a human or ask for clarification rather than guessing.
  • Output validation — Check model outputs against schemas, business rules, or sanity checks before acting on them.
  • Fallback paths — Provide a graceful degradation path (such as "I'm not sure, let me connect you with a specialist") rather than a wrong answer.
  • Monitoring — Track accuracy metrics in production and set alerts for degradation so you can catch regressions early.

Next steps