Token counting - Qwen Cloud

Tokens are the fundamental billing and context-management unit for text and vision models on Qwen Cloud. Understanding how tokens are counted helps you estimate costs, stay within context limits, and optimize your prompts. Audio and image generation models use different billing units (seconds, characters, or images) — this guide covers those too.

Text tokens

Text models tokenize input and output into subword units. A rough rule of thumb: 1 token ≈ 4 characters in English or 1 token ≈ 1.5 Chinese characters. The exact count depends on the tokenizer and vocabulary.

Read token usage from API responses

Every text generation response includes a usage object with the exact token count:

{
  "usage": {
    "prompt_tokens": 34,
    "completion_tokens": 89,
    "total_tokens": 123
  }
}

When using context cache, additional detail is available:

{
  "usage": {
    "prompt_tokens": 1520,
    "completion_tokens": 85,
    "total_tokens": 1605,
    "prompt_tokens_details": {
      "cached_tokens": 1480,
      "cache_creation_input_tokens": 0
    }
  }
}

When using reasoning mode, the response includes reasoning tokens:

{
  "usage": {
    "prompt_tokens": 50,
    "completion_tokens": 300,
    "total_tokens": 350,
    "completion_tokens_details": {
      "reasoning_tokens": 245
    }
  }
}

Reasoning tokens count toward completion_tokens and are billed at the output token rate. They can significantly increase the total token usage for complex reasoning tasks.

Estimate tokens before sending a request

To estimate token counts before making an API call, use the tokenizer directly. Qwen models use a tiktoken-compatible tokenizer:

# pip install tiktoken
import tiktoken

# Use the Qwen tokenizer
encoding = tiktoken.get_encoding("o200k_base")
tokens = encoding.encode("Your prompt text here")
print(f"Token count: {len(tokens)}")

Token estimation is useful for staying within context windows and budgeting costs. For exact counts, always rely on the usage field in the API response.

Vision tokens

Vision models (Qwen-VL series) convert images and video frames into tokens alongside text. The token count depends on image resolution.

Image token formula

image_tokens = ceil(height / 28) × ceil(width / 28) / 4 + 2

Where:

The image is resized to fit within max_pixels (default: 1003520 pixels) while maintaining aspect ratio, with dimensions rounded to the nearest multiple of 28
/ 4 accounts for the 2×2 pixel merge in the vision encoder
+ 2 adds the <vision_bos> and <vision_eos> special tokens

Example: A 1024×1024 image ≈ (1008/28) × (1008/28) / 4 + 2 = 326 tokens.

Estimate image tokens in Python

import math

def estimate_image_tokens(width, height, max_pixels=1003520, min_pixels=3136):
  """Estimate token count for an image in Qwen vision models."""
  # Resize to fit within pixel budget
  total_pixels = width * height
  if total_pixels > max_pixels:
    scale = math.sqrt(max_pixels / total_pixels)
    width = int(width * scale)
    height = int(height * scale)

  # Round to nearest multiple of 28
  width = max(28, round(width / 28) * 28)
  height = max(28, round(height / 28) * 28)

  # Calculate tokens
  return (height // 28) * (width // 28) // 4 + 2

# Examples
print(estimate_image_tokens(1024, 1024))   # ~326 tokens
print(estimate_image_tokens(1920, 1080))   # ~326 tokens
print(estimate_image_tokens(4096, 4096))   # ~326 tokens (resized down)

High-resolution mode

Enable vl_high_resolution_images to process images at higher fidelity (28×28 pixels per token block instead of the default effective rate). This increases token count — up to 16,384 tokens per image — but improves detail recognition for tasks like OCR or small-text reading.

Video tokens

Video inputs are sampled as individual frames, each tokenized using the same image formula. The total video token count equals the sum of tokens across all sampled frames. Frame sampling rate depends on the model and video duration.

Audio billing units

Audio APIs do not use tokens. Instead, they bill by duration or character count:

API	Billing unit	Details
Speech-to-text (ASR)	Seconds of audio	Billed per second of input audio
Text-to-speech (TTS)	Characters	Billed per character of input text
Speech-to-speech	Seconds of audio	Varies by model

ASR and TTS responses do not include a usage.prompt_tokens field. Check the Pricing page for current per-unit rates.

Image and video generation billing

Image and video generation APIs do not use tokens either:

API	Billing unit	Details
Image generation	Per image	Each generated image is billed individually, regardless of resolution
Video generation	Per second of video	Billed by output video duration and resolution

Image generation API responses include usage fields with image_count, not token counts. The input_tokens and output_tokens fields may appear as 0.

Cost estimation

To estimate the cost of a text/vision API call:

cost = (input_tokens × input_price_per_token) + (output_tokens × output_price_per_token)

Where cached tokens are billed at a reduced rate (10% for explicit cache, 20% for implicit cache). See Pricing for per-model rates and Cost optimization for strategies to reduce spending.

Context window limits

Each model has a maximum context window that limits the total input tokens:

Model	Max input tokens	Max output tokens
qwen3.7-max	1M	64K
qwen3.7-plus	1M	64K
qwen3.5-flash	1M	64K
qwen3-max	256K	64K

When your input approaches the context limit, consider:

Trimming conversation history in multi-turn conversations
Using context cache to reduce re-computation (does not reduce token count, but reduces cost and latency)
Summarizing older context into a condensed system message

Next steps

Pricing — Per-model token rates and billing details
Cost optimization — Reduce token usage and spending
Context cache — Cache repeated prompt prefixes for lower cost and latency
Vision understanding — Detailed image and video token calculation

​Text tokens

​Read token usage from API responses

​Estimate tokens before sending a request

​Vision tokens

​Image token formula

​Estimate image tokens in Python

​High-resolution mode

​Video tokens

​Audio billing units

​Image and video generation billing

​Cost estimation

​Context window limits

​Next steps

Text tokens

Read token usage from API responses

Estimate tokens before sending a request

Vision tokens

Image token formula

Estimate image tokens in Python

High-resolution mode

Video tokens

Audio billing units

Image and video generation billing

Cost estimation

Context window limits

Next steps