Skip to main content

Token counting

Understand and estimate token usage across text, vision, and audio models

Tokens are the fundamental billing and context-management unit for text and vision models on Qwen Cloud. Understanding how tokens are counted helps you estimate costs, stay within context limits, and optimize your prompts. Audio and image generation models use different billing units (seconds, characters, or images) — this guide covers those too.

Text tokens

Text models tokenize input and output into subword units. A rough rule of thumb: 1 token ≈ 4 characters in English or 1 token ≈ 1.5 Chinese characters. The exact count depends on the tokenizer and vocabulary.

Read token usage from API responses

Every text generation response includes a usage object with the exact token count:
{
  "usage": {
    "prompt_tokens": 34,
    "completion_tokens": 89,
    "total_tokens": 123
  }
}
When using context cache, additional detail is available:
{
  "usage": {
    "prompt_tokens": 1520,
    "completion_tokens": 85,
    "total_tokens": 1605,
    "prompt_tokens_details": {
      "cached_tokens": 1480,
      "cache_creation_input_tokens": 0
    }
  }
}
When using reasoning mode, the response includes reasoning tokens:
{
  "usage": {
    "prompt_tokens": 50,
    "completion_tokens": 300,
    "total_tokens": 350,
    "completion_tokens_details": {
      "reasoning_tokens": 245
    }
  }
}
Reasoning tokens count toward completion_tokens and are billed at the output token rate. They can significantly increase the total token usage for complex reasoning tasks.

Estimate tokens before sending a request

To estimate token counts before making an API call, use the tokenizer directly. Qwen models use a tiktoken-compatible tokenizer:
# pip install tiktoken
import tiktoken

# Use the Qwen tokenizer
encoding = tiktoken.get_encoding("o200k_base")
tokens = encoding.encode("Your prompt text here")
print(f"Token count: {len(tokens)}")
Token estimation is useful for staying within context windows and budgeting costs. For exact counts, always rely on the usage field in the API response.

Vision tokens

Vision models (Qwen-VL series) convert images and video frames into tokens alongside text. The token count depends on image resolution.

Image token formula

image_tokens = ceil(height / 28) × ceil(width / 28) / 4 + 2
Where:
  • The image is resized to fit within max_pixels (default: 1003520 pixels) while maintaining aspect ratio, with dimensions rounded to the nearest multiple of 28
  • / 4 accounts for the 2×2 pixel merge in the vision encoder
  • + 2 adds the <vision_bos> and <vision_eos> special tokens
Example: A 1024×1024 image ≈ (1008/28) × (1008/28) / 4 + 2 = 326 tokens.

Estimate image tokens in Python

import math

def estimate_image_tokens(width, height, max_pixels=1003520, min_pixels=3136):
  """Estimate token count for an image in Qwen vision models."""
  # Resize to fit within pixel budget
  total_pixels = width * height
  if total_pixels > max_pixels:
    scale = math.sqrt(max_pixels / total_pixels)
    width = int(width * scale)
    height = int(height * scale)

  # Round to nearest multiple of 28
  width = max(28, round(width / 28) * 28)
  height = max(28, round(height / 28) * 28)

  # Calculate tokens
  return (height // 28) * (width // 28) // 4 + 2

# Examples
print(estimate_image_tokens(1024, 1024))   # ~326 tokens
print(estimate_image_tokens(1920, 1080))   # ~690 tokens
print(estimate_image_tokens(4096, 4096))   # ~326 tokens (resized down)

High-resolution mode

Enable vl_high_resolution_images to process images at higher fidelity (28×28 pixels per token block instead of the default effective rate). This increases token count — up to 16,384 tokens per image — but improves detail recognition for tasks like OCR or small-text reading.

Video tokens

Video inputs are sampled as individual frames, each tokenized using the same image formula. The total video token count equals the sum of tokens across all sampled frames. Frame sampling rate depends on the model and video duration.

Audio billing units

Audio APIs do not use tokens. Instead, they bill by duration or character count:
APIBilling unitDetails
Speech-to-text (ASR)Seconds of audioBilled per second of input audio
Text-to-speech (TTS)CharactersBilled per character of input text
Speech-to-speechSeconds of audioVaries by model
ASR and TTS responses do not include a usage.prompt_tokens field. Check the Pricing page for current per-unit rates.

Image and video generation billing

Image and video generation APIs do not use tokens either:
APIBilling unitDetails
Image generationPer imageEach generated image is billed individually, regardless of resolution
Video generationPer second of videoBilled by output video duration and resolution
Image generation API responses include usage fields with image_count, not token counts. The input_tokens and output_tokens fields may appear as 0.

Cost estimation

To estimate the cost of a text/vision API call:
cost = (input_tokens × input_price_per_token) + (output_tokens × output_price_per_token)
Where cached tokens are billed at a reduced rate (10% for explicit cache, 20% for implicit cache). See Pricing for per-model rates and Cost optimization for strategies to reduce spending.

Context window limits

Each model has a maximum context window that limits the total input tokens:
ModelMax input tokensMax output tokens
qwen3-max128K (256K with long context)16K
qwen3.5-plus128K16K
qwen3.5-flash128K16K
When your input approaches the context limit, consider:
  • Trimming conversation history in multi-turn conversations
  • Using context cache to reduce re-computation (does not reduce token count, but reduces cost and latency)
  • Summarizing older context into a condensed system message

Next steps