Tokens are the fundamental billing and context-management unit for text and vision models on Qwen Cloud. Understanding how tokens are counted helps you estimate costs, stay within context limits, and optimize your prompts. Audio and image generation models use different billing units (seconds, characters, or images) — this guide covers those too.
Text tokens
Text models tokenize input and output into subword units. A rough rule of thumb: 1 token ≈ 4 characters in English or 1 token ≈ 1.5 Chinese characters. The exact count depends on the tokenizer and vocabulary.
Read token usage from API responses
Every text generation response includes a usage object with the exact token count:
{
"usage": {
"prompt_tokens": 34,
"completion_tokens": 89,
"total_tokens": 123
}
}
When using context cache, additional detail is available:
{
"usage": {
"prompt_tokens": 1520,
"completion_tokens": 85,
"total_tokens": 1605,
"prompt_tokens_details": {
"cached_tokens": 1480,
"cache_creation_input_tokens": 0
}
}
}
When using reasoning mode, the response includes reasoning tokens:
{
"usage": {
"prompt_tokens": 50,
"completion_tokens": 300,
"total_tokens": 350,
"completion_tokens_details": {
"reasoning_tokens": 245
}
}
}
Reasoning tokens count toward completion_tokens and are billed at the output token rate. They can significantly increase the total token usage for complex reasoning tasks.
Estimate tokens before sending a request
To estimate token counts before making an API call, use the tokenizer directly. Qwen models use a tiktoken-compatible tokenizer:
# pip install tiktoken
import tiktoken
# Use the Qwen tokenizer
encoding = tiktoken.get_encoding("o200k_base")
tokens = encoding.encode("Your prompt text here")
print(f"Token count: {len(tokens)}")
Token estimation is useful for staying within context windows and budgeting costs. For exact counts, always rely on the usage field in the API response.
Vision tokens
Vision models (Qwen-VL series) convert images and video frames into tokens alongside text. The token count depends on image resolution.
image_tokens = ceil(height / 28) × ceil(width / 28) / 4 + 2
Where:
- The image is resized to fit within
max_pixels (default: 1003520 pixels) while maintaining aspect ratio, with dimensions rounded to the nearest multiple of 28
/ 4 accounts for the 2×2 pixel merge in the vision encoder
+ 2 adds the <vision_bos> and <vision_eos> special tokens
Example: A 1024×1024 image ≈ (1008/28) × (1008/28) / 4 + 2 = 326 tokens.
Estimate image tokens in Python
import math
def estimate_image_tokens(width, height, max_pixels=1003520, min_pixels=3136):
"""Estimate token count for an image in Qwen vision models."""
# Resize to fit within pixel budget
total_pixels = width * height
if total_pixels > max_pixels:
scale = math.sqrt(max_pixels / total_pixels)
width = int(width * scale)
height = int(height * scale)
# Round to nearest multiple of 28
width = max(28, round(width / 28) * 28)
height = max(28, round(height / 28) * 28)
# Calculate tokens
return (height // 28) * (width // 28) // 4 + 2
# Examples
print(estimate_image_tokens(1024, 1024)) # ~326 tokens
print(estimate_image_tokens(1920, 1080)) # ~690 tokens
print(estimate_image_tokens(4096, 4096)) # ~326 tokens (resized down)
High-resolution mode
Enable vl_high_resolution_images to process images at higher fidelity (28×28 pixels per token block instead of the default effective rate). This increases token count — up to 16,384 tokens per image — but improves detail recognition for tasks like OCR or small-text reading.
Video tokens
Video inputs are sampled as individual frames, each tokenized using the same image formula. The total video token count equals the sum of tokens across all sampled frames. Frame sampling rate depends on the model and video duration.
Audio billing units
Audio APIs do not use tokens. Instead, they bill by duration or character count:
| API | Billing unit | Details |
|---|
| Speech-to-text (ASR) | Seconds of audio | Billed per second of input audio |
| Text-to-speech (TTS) | Characters | Billed per character of input text |
| Speech-to-speech | Seconds of audio | Varies by model |
ASR and TTS responses do not include a usage.prompt_tokens field. Check the Pricing page for current per-unit rates.
Image and video generation billing
Image and video generation APIs do not use tokens either:
| API | Billing unit | Details |
|---|
| Image generation | Per image | Each generated image is billed individually, regardless of resolution |
| Video generation | Per second of video | Billed by output video duration and resolution |
Image generation API responses include usage fields with image_count, not token counts. The input_tokens and output_tokens fields may appear as 0.
Cost estimation
To estimate the cost of a text/vision API call:
cost = (input_tokens × input_price_per_token) + (output_tokens × output_price_per_token)
Where cached tokens are billed at a reduced rate (10% for explicit cache, 20% for implicit cache). See Pricing for per-model rates and Cost optimization for strategies to reduce spending.
Context window limits
Each model has a maximum context window that limits the total input tokens:
| Model | Max input tokens | Max output tokens |
|---|
| qwen3-max | 128K (256K with long context) | 16K |
| qwen3.5-plus | 128K | 16K |
| qwen3.5-flash | 128K | 16K |
When your input approaches the context limit, consider:
- Trimming conversation history in multi-turn conversations
- Using context cache to reduce re-computation (does not reduce token count, but reduces cost and latency)
- Summarizing older context into a condensed system message
Next steps