Maximize correctness and consistent behavior across text, image, video, speech, and vision models on Qwen Cloud
"Accuracy" means different things for different modalities: for text generation it means factually correct and well-formatted answers; for image and video generation it means faithfully matching the prompt's intent with high visual quality; for speech synthesis it means natural-sounding audio with the right voice characteristics; and for vision or omni models it means correctly interpreting visual inputs. The underlying optimization approach, however, is the same across all of them.
This guide walks through a structured approach to diagnosing and fixing accuracy issues — from prompt engineering to RAG to fine-tuning — so you can move from a working prototype to a reliable production system. While much of the detailed guidance focuses on text generation, each section includes strategies for other modalities where applicable.
Think of accuracy optimization along two axes:
Prompt engineering is always the first step. It requires no infrastructure, gives immediate feedback, and often solves more than you expect.
For detailed prompt engineering techniques, see Text generation prompt engineering.
Image generation
Before optimizing, you need a way to measure improvement. Build an evaluation set of 30-50 question-answer pairs that represent your real use case. Each pair should include:
For tasks where correctness is nuanced, LLM-as-judge with a clear scoring rubric often provides the best signal-to-noise ratio.
Once you have evaluation results, categorize failures into two types:
RAG injects relevant external knowledge into each request at inference time. Instead of relying solely on what the model learned during training, you retrieve the most relevant documents from your knowledge base and include them in the prompt. RAG primarily applies to text and vision tasks. For image and video generation, the equivalent approach is providing reference images, detailed style descriptions, or audio samples to guide the output.
Use RAG when:
RAG accuracy depends on two things: retrieval quality and generation quality. Evaluate them separately:
The most effective production systems combine prompt engineering and RAG:
Perfect accuracy is rarely achievable or necessary. The right target depends on the cost of errors versus the cost of improvement.
Consider a customer service scenario:
The jump from 90% to 95% often requires more effort than the jump from 0% to 90%. Decide where the marginal cost of improvement exceeds the marginal value.
Design your system to fail gracefully when the model gets it wrong:
The optimization framework
Think of accuracy optimization along two axes:
- Context optimization — the model lacks knowledge or guidance. For text models, fix this with better prompts or retrieval-augmented generation. For image/video models, fix this with more descriptive prompts, reference images, or style descriptors. For speech, fix this with model selection and voice configuration. This maximizes correctness.
- Model optimization — the model behaves inconsistently. For text models, fix this with few-shot examples or fine-tuning. For generation models, fix this with seed pinning and consistent parameter sets. This maximizes consistency.
- Start with prompt engineering — get the best results possible with clear instructions and examples
- Build an evaluation set to measure progress objectively
- Diagnose whether the problem is missing knowledge, inconsistent behavior, or both
- Add RAG if the model needs external information
- Add fine-tuning if the model needs to learn new patterns
- Combine techniques and iterate
Prompt engineering
Prompt engineering is always the first step. It requires no infrastructure, gives immediate feedback, and often solves more than you expect.
| Strategy | When to use | Example |
|---|---|---|
| Be specific | Model gives vague or off-target answers | Instead of "Summarize this", say "Summarize this article in 3 bullet points, each under 20 words" |
| Add context | Model lacks domain knowledge | Include relevant reference text, definitions, or constraints in the system prompt |
| Use few-shot examples | Model format or style is inconsistent | Provide 2-5 input/output examples that demonstrate the exact behavior you want |
| Chain of thought | Model makes reasoning errors | Add "Think step by step" or provide a worked example showing the reasoning process |
| Set constraints | Model produces unwanted content | Explicitly state what to include, exclude, and how to handle edge cases |
Long-context models like qwen3-max support inputs up to 256K tokens, but information placed in the middle of very long contexts may receive less attention than information at the beginning or end. Put your most important context — instructions, key reference material — at the start or end of the prompt.
Prompt engineering for other modalities
Image generation
- Use a structured prompt formula: Subject + Setting + Style + Mood + Technical details. For example, "A golden retriever sitting in a sunlit garden, watercolor style, warm and peaceful, soft focus background." See Image generation prompt engineering for detailed techniques.
- Use negative prompts to exclude common defects such as blurriness, extra limbs, or text artifacts.
- The
prompt_extend(prompt rewriting) feature can enrich short prompts automatically, but adds 3-5 seconds of latency. Enable it during exploration, disable it in production when you have well-crafted prompts. - Model selection matters: Qwen-Image excels at text rendering within images, while Wan produces more photorealistic results. See Text-to-image for model comparisons.
- Structure prompts as Entity + Scene + Motion. Explicitly describe the motion you want -- the model will not infer movement from a static scene description. See Video generation prompt engineering.
- Use multi-shot mode to maintain subject consistency across scenes. See Text-to-video.
- Use negative prompts to suppress visual artifacts and fix a
seedvalue to ensure reproducible results during iteration.
- Optimize input image and video quality: crop to the region of interest and ensure adequate resolution for the details you need the model to analyze.
- Place the most critical question at the beginning or end of your prompt for best attention.
- For text extraction from images, use the dedicated OCR endpoint for higher accuracy. See OCR and text extraction.
- For text-to-speech (TTS), match the model to your scenario: use
instructmodels for emotional expression and fine-grained control,flashmodels for low-latency short text, andvdmodels when you need to design a custom voice from a text description. See Speech synthesis. - With instruct TTS models, use natural language instructions to control speech rate, emotion, and character (such as "Speak slowly in a warm, friendly tone").
- For automatic speech recognition (ASR), ensure audio input is clear with minimal background noise. For large files, pass a URL rather than Base64-encoded data to avoid request size limits. See Speech recognition.
Build an evaluation set
Before optimizing, you need a way to measure improvement. Build an evaluation set of 30-50 question-answer pairs that represent your real use case. Each pair should include:
- Input: The exact prompt (including system message and any context) you would send to the model
- Expected output: The correct or ideal response
- Evaluation criteria: How you judge whether the response is acceptable
| Approach | Best for | How it works |
|---|---|---|
| Exact match | Classification, extraction, yes/no tasks | Compare model output directly against the expected answer |
| ROUGE / BERTScore | Summarization, translation | Measure token overlap or semantic similarity between model output and reference |
| LLM-as-judge | Open-ended generation, complex tasks | Use a capable model like qwen3-max to score outputs against your rubric |
| Task-specific metrics | Domain-specific tasks | Custom metrics (such as valid JSON, correct SQL syntax, entity coverage) |
Evaluating non-text modalities
- Image and video generation: Score outputs on a 1-5 scale across dimensions such as prompt faithfulness, visual quality, text clarity (if applicable), and absence of artifacts. Build a rubric with reference examples for each score level.
- Speech (TTS): Use Mean Opinion Score (MOS) testing across naturalness, intelligibility, and voice similarity (for cloned voices). Have multiple raters evaluate each sample.
- Speech (ASR): Measure Word Error Rate (WER) against ground-truth transcriptions on a representative audio test set.
- Vision and omni: Use an LLM-as-judge approach -- send the model's output along with the source image and expected answer to a capable text model for automated scoring.
Diagnose the problem
Once you have evaluation results, categorize failures into two types:
- Missing knowledge (context problem) — The model does not have the information it needs. For text, it may hallucinate facts or give generic answers. For image/video generation, this manifests as prompts that lack sufficient detail, producing outputs that miss the intended subject, style, or composition. Fix this with RAG (for text/vision) or richer prompts and reference media (for generation tasks).
- Inconsistent behavior (learning problem) — The model has the information but does not reliably use it. For text, it may format outputs inconsistently or ignore instructions. For image/video generation, this manifests as varying styles, compositions, or quality across runs with the same prompt. Fix this with few-shot examples, fine-tuning (text), or seed pinning and parameter stabilization (generation).
Retrieval-augmented generation (RAG)
RAG injects relevant external knowledge into each request at inference time. Instead of relying solely on what the model learned during training, you retrieve the most relevant documents from your knowledge base and include them in the prompt. RAG primarily applies to text and vision tasks. For image and video generation, the equivalent approach is providing reference images, detailed style descriptions, or audio samples to guide the output.
Use RAG when:
- The model needs information that changes frequently (product catalogs, knowledge bases, policies)
- The model needs proprietary or domain-specific data not in its training set
- You need to cite sources or ground responses in specific documents
- For vision tasks, the model needs to analyze images against a reference knowledge base (such as identifying products from a catalog, verifying document formats, or comparing visual assets against brand guidelines)
Evaluate your RAG pipeline
RAG accuracy depends on two things: retrieval quality and generation quality. Evaluate them separately:
| Metric | What it measures | How to check |
|---|---|---|
| Retrieval precision | Are the retrieved documents relevant? | Sample 50 queries, check if top-3 results contain the needed information |
| Retrieval recall | Are all relevant documents found? | For known-answer questions, verify the source document appears in results |
| Generation faithfulness | Does the model stick to retrieved content? | Check if answers are grounded in the provided context, not hallucinated |
| End-to-end accuracy | Does the full pipeline give correct answers? | Run your evaluation set through the complete RAG pipeline |
Retrieval optimization tips
- Chunk size matters — Chunks that are too small lose context; chunks that are too large dilute relevance. Start with 200-500 tokens and experiment.
- Use reranking — A two-stage pipeline (fast retrieval then precise reranking) significantly improves precision without sacrificing recall.
- Embed queries and documents consistently — Use the same embedding model and preprocessing for both indexing and querying.
- Add metadata filters — Filter by date, category, or source before semantic search to narrow the candidate set.
- Text embedding — Generate embeddings for semantic search
- Reranking — Improve retrieval precision with a reranking model
Combining techniques
The most effective production systems combine prompt engineering and RAG:
- Prompt engineering + RAG — Clear instructions tell the model how to use the retrieved context. Few-shot examples show the expected format.
- Structured output + RAG — Use structured output to enforce consistent response formats when processing retrieved documents.
- More context can add noise. Retrieving too many documents or stuffing too much information into the prompt can confuse the model. Always validate that adding context actually improves your evaluation scores.
- Exhaust simpler methods first. Prompt engineering is fast and free. RAG requires moderate effort. Move to the next technique only when the current one plateaus.
How much accuracy is good enough
Business perspective
Perfect accuracy is rarely achievable or necessary. The right target depends on the cost of errors versus the cost of improvement.
Consider a customer service scenario:
| Accuracy level | Behavior | Business impact |
|---|---|---|
| 80% | Handles common questions correctly, fails on edge cases | Reduces support volume but requires human review for ~20% of queries |
| 90% | Handles most questions including some edge cases | Significant cost savings, occasional escalations |
| 95% | Handles nearly all questions correctly | Near-full automation, rare human intervention |
| 99% | Near-perfect responses | Diminishing returns — the last 4% may cost more than the first 95% |
Technical perspective
Design your system to fail gracefully when the model gets it wrong:
- Confidence thresholds — If the model is uncertain, escalate to a human or ask for clarification rather than guessing.
- Output validation — Check model outputs against schemas, business rules, or sanity checks before acting on them.
- Fallback paths — Provide a graceful degradation path (such as "I'm not sure, let me connect you with a specialist") rather than a wrong answer.
- Monitoring — Track accuracy metrics in production and set alerts for degradation so you can catch regressions early.
Next steps
- Text generation prompt engineering — Detailed prompt engineering techniques
- Image generation prompt engineering — Craft effective prompts for image generation
- Video generation prompt engineering — Craft effective prompts for video generation
- Speech synthesis — TTS model selection and voice configuration
- Vision understanding — Image and video understanding with vision models
- Text embedding — Build RAG pipelines with embedding models
- Reranking — Improve retrieval quality
- Structured output — Get consistent, parseable model outputs
- Model selection — Pick the right model for your task
- Cost optimization — Reduce spending while maintaining quality