Pay-as-you-go pricing for API usage
Billing varies by model type: text models charge per token, image generation per image, video generation per second, and speech models per character or per second of audio.
Billed per million tokens. Models with long-context support use tiered pricing — longer prompts cost more per token.
For complete text model pricing, see Model Marketplace.
Vision understanding is billed per token. Qwen text generation models (qwen3.6-plus, etc.) support vision input at the same token price listed above. Dedicated vision models have separate pricing:
Image and video inputs are automatically converted to tokens. The conversion varies by model:
Video tokens = sampled frames × tokens per frame. See Token counting → for details.
Image generation is billed per image (resolution-independent). Video generation is billed per second of output video.
Image generation
Video generation
For all image and video model pricing, see Model Marketplace.
Billed per 10,000 characters of input text.
Billed per second of audio input.
Billed per million tokens, with different rates per modality.
Token conversion
qwen3-omni-flash pricing
For all speech model pricing, see Model Marketplace.
Billed per million input tokens (output is not charged). Multimodal embedding models may charge different rates for image vs text input. Image/video token conversion for embedding models is handled internally — check the
For all embedding and reranking model pricing, see Model Marketplace.
Some built-in tools incur per-call fees in addition to model token costs.
Function calling and MCP have no tool fees — tool descriptions count as input tokens.
New users get 1 million free tokens per model for 90 days. Applies to real-time API calls only. Learn more →
For worked examples and advanced strategies, see Cost optimization →.
Text generation
Billed per million tokens. Models with long-context support use tiered pricing — longer prompts cost more per token.
| Model | Context tier | Input | Output |
|---|---|---|---|
| qwen3.6-plus | ≤ 256K | $0.50 | $3.00 |
| 256K – 1M | $2.00 | $6.00 | |
| qwen3.5-plus | ≤ 256K | $0.40 | $2.40 |
| 256K – 1M | $0.50 | $3.00 | |
| qwen3.5-flash | ≤ 1M | $0.10 | $0.40 |
| qwen3-max | ≤ 32K | $1.20 | $6.00 |
| 32K – 128K | $2.40 | $12.00 | |
| 128K – 252K | $3.00 | $15.00 |
Images & videos
Understanding
Vision understanding is billed per token. Qwen text generation models (qwen3.6-plus, etc.) support vision input at the same token price listed above. Dedicated vision models have separate pricing:
| Model | Context tier | Input | Output |
|---|---|---|---|
| qwen3-vl-plus | ≤ 32K | $0.20 | $1.60 |
| 32K – 128K | $0.30 | $2.40 | |
| 128K – 256K | $0.60 | $4.80 | |
| qwen3-vl-flash | ≤ 32K | $0.05 | $0.40 |
| 32K – 128K | $0.075 | $0.60 | |
| 128K – 256K | $0.12 | $0.96 |
| Model family | Image conversion | Example (1024×1024) |
|---|---|---|
| Qwen (qwen3.6-plus, etc.) | 1 token per 32×32 pixels | ≈ 256 tokens |
| Qwen-VL (qwen3-vl, etc.) | 1 token per 32×32 pixels | ≈ 256 tokens |
| Qwen3-Omni-Flash | 1 token per 32×32 pixels | ≈ 256 tokens |
Generation
Image generation is billed per image (resolution-independent). Video generation is billed per second of output video.
Image generation
| Model | Price per image |
|---|---|
| qwen-image-2.0-pro | $0.075 |
| qwen-image-2.0 | $0.035 |
| qwen-image-edit | $0.045 |
| wan2.6-t2i | $0.03 |
| wan2.6-image | $0.03 |
| z-image-turbo | $0.015 (prompt rewrite off) / $0.03 (on) |
| Model | Price per second |
|---|---|
| wan2.6-t2v | $0.10 |
| wan2.6-i2v | $0.10 |
| wan2.6-i2v-flash | $0.05 |
Audio & speech
Text to speech
Billed per 10,000 characters of input text.
| Model | Price per 10K chars |
|---|---|
| cosyvoice-v3-plus | $0.22 |
| cosyvoice-v3-flash | $0.116 |
| qwen3-tts-flash | $0.10 |
Speech to text
Billed per second of audio input.
| Model | Price per second |
|---|---|
| fun-asr | $0.000035 |
| fun-asr-realtime | $0.00009 |
| qwen3-asr-flash | $0.000035 |
Speech to speech
Qwen-Omni is a multimodal model that handles text, audio, and image/video in a single call. All modality prices are listed in the table below.
| Input type | Conversion rate |
|---|---|
| Text | Standard tokenizer |
| Audio | ≈ 12.5 tokens/sec (Qwen3-Omni-Flash) or 25 tokens/sec (Qwen-Omni-Turbo) |
| Image/Video | See Understanding section above |
| Modality | Price per 1M tokens |
|---|---|
| Text input (pure text) | $0.43 |
| Text input (multimodal context) | $1.66 |
| Audio input | $3.81 |
| Image/Video input | $0.78 |
| Text output | $3.06 |
| Text + Audio output | $15.11 |
Embedding & reranking
Billed per million input tokens (output is not charged). Multimodal embedding models may charge different rates for image vs text input. Image/video token conversion for embedding models is handled internally — check the usage field in the API response for actual token counts.
| Model | Modality | Price per 1M tokens |
|---|---|---|
| text-embedding-v4 | Text | $0.07 |
| tongyi-embedding-vision-plus | All | $0.09 |
| tongyi-embedding-vision-flash | Image/Video | $0.03 |
| Text | $0.09 | |
| qwen3-rerank | Text | $0.10 |
Built-in tools
Some built-in tools incur per-call fees in addition to model token costs.
| Tool | Fee | Notes |
|---|---|---|
| Web Search | $10 / 1K calls | |
| Web Extractor | FREE | Limited time |
| Code Interpreter | FREE | Limited time |
| Image Search | $8 / 1K calls | Text-to-image and image-to-image |
Free quota
New users get 1 million free tokens per model for 90 days. Applies to real-time API calls only. Learn more →
Save on costs
- Batch API — 50% off for async workloads. Learn more →
- Context caching — Reuse long prompts at reduced cost. Learn more →
- Model selection — Match model tier to task complexity. Compare models →
Batch and cache discounts cannot be combined on the same request.
Learn more
- Model Marketplace — Complete pricing for all models
- Free quota — Eligibility and activation
- Cost optimization — Advanced strategies
- Coding Plan — Fixed monthly pricing for AI coding tools
- Billing FAQ — Common questions
- Bill management — View usage and invoices