Spend less on text, image, video, and speech API calls while maintaining output quality
Qwen Cloud models use different billing models depending on the modality: text models charge per token, image generation charges per image, video generation charges per second of output, TTS charges per character, and ASR charges per audio duration. This guide covers practical strategies for reducing costs across all modalities while maintaining output quality. Each strategy can be used independently or combined for greater savings.
The most direct way to cut costs is to use fewer resources per task:
Image generation
Batch calling applies to text generation models only. If your workload doesn't require real-time responses, batch calling cuts token costs by 50%. You package requests into a JSONL file, submit them as a single job, and download the results once processing finishes — usually within 24 hours.
Common use cases:
Context cache applies to text and vision models only. Reduce input token costs when your requests share common prefixes — such as long system prompts, reference documents, or conversation history. Context cache avoids redundant computation on repeated content, improving both speed and cost.
Qwen Cloud offers three cache modes:
Get started with context cache
Here's how these strategies stack up for a hypothetical workload of 10 million input tokens + 5 million output tokens using qwen3.5-plus:
The table below illustrates how parameter choices affect cost for non-text modalities:
See Pricing for current per-unit rates for each modality and model.
Optimize your API usage
The most direct way to cut costs is to use fewer resources per task:
- Consolidate API calls: If you're making multiple small requests, consider merging them into a single call. For example, classify ten items in one prompt instead of sending ten separate requests.
- Keep prompts lean: Remove redundant instructions from system prompts, limit conversation history to recent turns, and set
max_tokensto prevent unnecessarily long outputs. - Match model to task complexity: Reserve high-capability models (qwen3-max) for tasks that genuinely need advanced reasoning. For straightforward jobs like classification, extraction, or summarization, lighter models such as qwen3.5-flash deliver good results at a fraction of the price. See Model overview for model comparisons.
Optimize non-text API usage
Image generation
- Image generation is billed per image regardless of resolution, so resolution does not affect cost. However, lower resolutions reduce generation time.
- Minimize
n(number of images per request), since each generated image is billed separately. - Disable prompt rewriting (
prompt_extend) in production when you already have well-crafted prompts to avoid unnecessary processing costs.
- During iteration, use short preview durations and lower resolutions (480P/720P). Most models require 5+ seconds; wan2.6 models support 2+ seconds. Render at high resolution (e.g., 15 seconds) only for final output.
- 480P or 720P is often sufficient for previews and social media content -- skip 1080P unless necessary.
- Reuse reference images and videos across multiple generations to avoid redundant uploads.
- Choose the right TTS model for your scenario:
flashvariants are the most cost-effective for simple narration. Reserveinstructmodels for scenarios requiring emotional expression or fine control. See Speech synthesis. - Batch short text segments into a single TTS call rather than making many small requests.
- For ASR, compress audio files before submission to reduce upload time. Note that ASR is billed by audio duration, so compression does not affect cost.
Batch calling
Batch calling applies to text generation models only. If your workload doesn't require real-time responses, batch calling cuts token costs by 50%. You package requests into a JSONL file, submit them as a single job, and download the results once processing finishes — usually within 24 hours.
Common use cases:
- Offline evaluation and benchmarking
- Bulk classification or data labeling
- Large-scale embedding generation
Context cache
Context cache applies to text and vision models only. Reduce input token costs when your requests share common prefixes — such as long system prompts, reference documents, or conversation history. Context cache avoids redundant computation on repeated content, improving both speed and cost.
Qwen Cloud offers three cache modes:
| Mode | How it works | Cache hit cost |
|---|---|---|
| Explicit cache | Manually mark content to cache. Guaranteed hit for 5 minutes. | 10% of standard input price |
| Implicit cache | Automatic — no configuration needed. System identifies common prefixes. | 20% of standard input price |
| Session cache | Designed for multi-turn conversations with the Responses API. | 10% of standard input price |
Batch calling and context cache discounts cannot be combined on the same request.
Cost comparison
Here's how these strategies stack up for a hypothetical workload of 10 million input tokens + 5 million output tokens using qwen3.5-plus:
| Strategy | Input cost | Output cost | Total |
|---|---|---|---|
| Standard | 10M x 4 | 5M x 12 | $16 |
| Batch calling (50% off) | 10M x 2 | 5M x 6 | $8 |
| Context cache (80% cache hit, implicit) | 2M x 0.08 = $1.44 | 5M x 12 | $13.44 |
Actual savings depend on your cache hit rate, prompt structure, and model choice. See Pricing for full rate tables.
Non-text cost optimization examples
The table below illustrates how parameter choices affect cost for non-text modalities:
| Modality | Baseline scenario | Optimized scenario | Savings approach |
|---|---|---|---|
| Image generation | 100 images with n=4 per call (400 billed images) | 100 images with n=1 per call (100 billed images) | Generate only the images you need; disable prompt rewriting |
| Video generation | 10 videos at 1080P, 15s each | 10 videos at 720P, 5s during iteration + 2 final at 1080P 15s | Low-res previews; full render only for finals |
| TTS | 10,000 characters via instruct model | 10,000 characters via flash model | Use cheapest model that meets quality needs |
Next steps
- Pricing — Understand billing modes and model rates
- Batch calling — Step-by-step batch API guide
- Context cache — Cache modes and usage details
- Image models — Image generation model options and pricing
- Video models — Video generation model options and pricing
- Speech synthesis — TTS model selection and cost considerations
- Latency optimization — Strategies that also reduce costs