Visual understanding models

Image and video understanding

Start with qwen3.7-plus — strongest accuracy, 1M context, 2-hour video support, and the full feature set including function calling and built-in tools. Once your use case works well, try qwen3.6-flash to reduce cost — near-flagship quality with the same context and features.

Image resolution

Most models support up to 16M pixels per image. Higher resolution costs more tokens: each image uses h × w / (32 × 32) + 2 tokens.

Video support

Up to 2 hours / 2GB → qwen3.7-plus, qwen3.6-plus, qwen3.6-flash, qwen3.5-plus, qwen3.5-flash
Up to 1 hour / 2GB → qwen3.5-omni-plus, qwen3.5-omni-flash (also accepts audio — see Speech models)

Function calling + built-in tools

Let the model take actions based on what it sees in images or video.

Function calling: Qwen3.6, Qwen3.5, Qwen3.5-Omni (including realtime models), and Qwen3-VL models
Built-in tools (web search, code execution — no setup): qwen3.7-plus, qwen3.6-plus, qwen3.6-flash, qwen3.5-plus, qwen3.5-flash only

Structured output

Get valid JSON from visual input — e.g., extract product info from a photo. Available on Qwen3.6 and Qwen3.5 in non-thinking mode.

Recommended models

Model	Context	Max pixels/image	Max video duration	Max video size	Max images (URL)	Max images (Base64)	Max videos	Function calling	Built-in tools	Structured output
`qwen3.7-plus`	1M	16M	2h	2GB	2,048	250	64	✓	✓	✓
`qwen3.6-plus`	1M	16M	2h	2GB	256	250	64	✓	✓	✓
`qwen3.6-flash`	1M	16M	2h	2GB	256	250	64	✓	✓	✓
`qwen3.5-flash`	1M	16M	2h	2GB	256	250	64	✓	✓	✓
`qwen3.5-omni-plus`	256k	—	1h	2GB	2,048	250	1	✓	—	—
`qwen3.5-omni-flash`	256k	—	1h	2GB	2,048	250	1	✓	—	—

All models

Qwen3.7

Model ID	Input	Output	Context	Max Output	Max images (URL)	Max images (Base64)	Max videos	Function calling	Built-in tools	Structured output
`qwen3.7-max-2026-06-08`	Text, image, video	Text	1M	64k	2,048	250	64	✓	✓	—
`qwen3.7-plus`	Text, image, video	Text	1M	64k	2,048	250	64	✓	✓	✓
`qwen3.7-plus-2026-05-26`	Text, image, video	Text	1M	64k	2,048	250	64	✓	✓	✓

Qwen3.6

Model ID	Input	Output	Context	Max Output	Max images (URL)	Max images (Base64)	Max videos	Function calling	Built-in tools	Structured output
`qwen3.6-plus`	Text, image, video	Text	1M	64k	256	250	64	✓	✓	✓
`qwen3.6-plus-2026-04-02`	Text, image, video	Text	1M	64k	256	250	64	✓	✓	✓
`qwen3.6-flash`	Text, image, video	Text	1M	64k	256	250	64	✓	✓	✓
`qwen3.6-35b-a3b`	Text, image, video	Text	32k	8k	256	250	64	✓	✓	✓
`qwen3.6-27b`	Text, image, video	Text	32k	8k	256	250	64	✓	—	✓

Qwen3.5

Model ID	Input	Output	Context	Max Output	Max images (URL)	Max images (Base64)	Max videos	Function calling	Built-in tools	Structured output
`qwen3.5-plus`	Text, image, video	Text	1M	64k	256	250	64	✓	✓	✓
`qwen3.5-plus-2026-04-20`	Text, image, video	Text	1M	64k	256	250	64	✓	✓	✓
`qwen3.5-plus-2026-02-15`	Text, image, video	Text	1M	64k	256	250	64	✓	✓	✓
`qwen3.5-flash`	Text, image, video	Text	1M	64k	256	250	64	✓	✓	✓
`qwen3.5-flash-2026-02-23`	Text, image, video	Text	1M	64k	256	250	64	✓	✓	✓
`qwen3.5-397b-a17b`	Text, image, video	Text	32k	8k	256	250	64	✓	✓	✓
`qwen3.5-122b-a10b`	Text, image, video	Text	32k	8k	256	250	64	✓	✓	✓
`qwen3.5-27b`	Text, image, video	Text	32k	8k	256	250	64	✓	✓	✓
`qwen3.5-35b-a3b`	Text, image, video	Text	32k	8k	256	250	64	✓	✓	✓

Qwen3.5-Omni

Unlike other models on this page, Qwen3.5-Omni accepts audio input and can output both text and speech.Standard

Model ID	Input	Output	Context	Max Output	Max images (URL)	Max images (Base64)	Max videos	Function calling	Built-in tools	Structured output
`qwen3.5-omni-plus`	Text, image, audio, video	Text, audio	256k	64k	2,048	250	1	✓	—	—
`qwen3.5-omni-plus-2026-03-15`	Text, image, audio, video	Text, audio	256k	64k	2,048	250	1	✓	—	—
`qwen3.5-omni-flash`	Text, image, audio, video	Text, audio	256k	64k	2,048	250	1	✓	—	—
`qwen3.5-omni-flash-2026-03-15`	Text, image, audio, video	Text, audio	256k	64k	2,048	250	1	✓	—	—

Realtime — streaming audio input with built-in Voice Activity Detection (VAD).

Model ID	Input	Output	Context	Max Output	Function calling
`qwen3.5-omni-plus-realtime`	Text, image, audio (streaming)	Text, audio	256k	64k	✓
`qwen3.5-omni-plus-realtime-2026-03-15`	Text, image, audio (streaming)	Text, audio	256k	64k	✓
`qwen3.5-omni-flash-realtime`	Text, image, audio (streaming)	Text, audio	256k	64k	✓
`qwen3.5-omni-flash-realtime-2026-03-15`	Text, image, audio (streaming)	Text, audio	256k	64k	✓

Captioner (open source) — audio captioning model.

Model ID	Input	Output	Context	Max Output	Max images (URL)	Max images (Base64)	Max videos	Function calling	Built-in tools	Structured output
`qwen3-omni-30b-a3b-captioner`	Audio	Text	64k	32k	—	—	—	—	—	—

Legacy

Older model versions retained for backward compatibility. We recommend Qwen3.5 or Qwen3.5-Omni for new projects.

Model ID	Input	Output	Context	Max Output	Max images (URL)	Max images (Base64)	Max videos	Function calling	Built-in tools	Structured output
`qwen-vl-ocr`	Text, image	Text	38k	8k	256	250	—	—	—	—
`qwen-vl-ocr-2025-11-20`	Text, image	Text	38k	8k	256	250	—	—	—	—
`qwen3-vl-plus`	Text, image, video	Text	256k	32k	256	250	64	✓	—	✓
`qwen3-vl-plus-2025-12-19`	Text, image, video	Text	256k	32k	256	250	64	✓	—	✓
`qwen3-vl-plus-2025-09-23`	Text, image, video	Text	256k	32k	256	250	64	✓	—	✓
`qwen3-vl-flash`	Text, image, video	Text	256k	32k	256	250	64	✓	—	✓
`qwen3-vl-flash-2026-01-22`	Text, image, video	Text	256k	32k	256	250	64	✓	—	✓
`qwen3-vl-flash-2025-10-15`	Text, image, video	Text	256k	32k	256	250	64	✓	—	✓
`qwen3-omni-flash`	Text, image, audio, video	Text, audio	64k	16k	2,048	250	1	✓	—	—
`qwen3-omni-flash-2025-12-01`	Text, image, audio, video	Text, audio	64k	16k	2,048	250	1	✓	—	—
`qwen3-omni-flash-2025-09-15`	Text, image, audio, video	Text, audio	64k	16k	2,048	250	1	✓	—	—
`qwen3-omni-flash-realtime`	Text, image, audio (streaming)	Text, audio	64k	16k	—	—	—	—	—	—
`qwen3-omni-flash-realtime-2025-12-01`	Text, image, audio (streaming)	Text, audio	64k	16k	—	—	—	—	—	—
`qwen3-omni-flash-realtime-2025-09-15`	Text, image, audio (streaming)	Text, audio	64k	16k	—	—	—	—	—	—
`qwen-omni-turbo`	Text, image, audio, video	Text, audio	32k	2k	2,048	250	1	—	—	—
`qwen-omni-turbo-latest`	Text, image, audio, video	Text, audio	32k	2k	2,048	250	1	—	—	—
`qwen-omni-turbo-2025-03-26`	Text, image, audio, video	Text, audio	32k	2k	2,048	250	1	—	—	—
`qwen-omni-turbo-realtime`	Text, audio (streaming)	Text, audio	32k	2k	—	—	—	—	—	—
`qwen-omni-turbo-realtime-latest`	Text, audio (streaming)	Text, audio	32k	2k	—	—	—	—	—	—
`qwen-omni-turbo-realtime-2025-05-08`	Text, audio (streaming)	Text, audio	32k	2k	—	—	—	—	—	—
`qwen3-vl-235b-a22b-thinking`	Text, image, video	Text	128k	8k	256	250	64	✓	—	—
`qwen3-vl-235b-a22b-instruct`	Text, image, video	Text	128k	8k	256	250	64	✓	—	✓
`qwen3-vl-32b-thinking`	Text, image, video	Text	128k	8k	256	250	64	✓	—	—
`qwen3-vl-32b-instruct`	Text, image, video	Text	128k	8k	256	250	64	✓	—	✓
`qwen3-vl-30b-a3b-thinking`	Text, image, video	Text	128k	8k	256	250	64	✓	—	—
`qwen3-vl-30b-a3b-instruct`	Text, image, video	Text	128k	8k	256	250	64	✓	—	✓
`qwen3-vl-8b-thinking`	Text, image, video	Text	128k	8k	256	250	64	✓	—	—
`qwen3-vl-8b-instruct`	Text, image, video	Text	128k	8k	256	250	64	✓	—	✓
`qwen2.5-vl-72b-instruct`	Text, image, video	Text	128k	8k	256	250	64	✓	—	✓
`qwen2.5-vl-32b-instruct`	Text, image, video	Text	128k	8k	256	250	64	✓	—	✓
`qwen2.5-vl-7b-instruct`	Text, image, video	Text	128k	8k	256	250	64	✓	—	✓
`qwen2.5-vl-3b-instruct`	Text, image, video	Text	128k	8k	256	250	64	✓	—	✓
`qwen2.5-omni-7b`	Text, image, audio, video	Text, audio	32k	2k	2,048	250	1	—	—	—
`qwen-vl-max`	Text, image	Text	32k	8k	256	250	—	—	—	—
`qwen-vl-max-latest`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qwen-vl-max-2025-08-13`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qwen-vl-max-2025-04-08`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qwen-vl-plus`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qwen-vl-plus-latest`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qwen-vl-plus-2025-08-15`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qwen-vl-plus-2025-05-07`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qwen-vl-plus-2025-01-25`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qvq-max`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qvq-max-latest`	Text, image	Text	128k	8k	256	250	—	—	—	—
`qvq-max-2025-03-25`	Text, image	Text	128k	8k	256	250	—	—	—	—

Visual understanding models

Image and video understanding

Image resolution

Video support

Function calling + built-in tools

Structured output

Recommended models

All models

Learn more

Vision understanding guide

Try free

​Image and video understanding

​Image resolution

​Video support

​Function calling + built-in tools

​Structured output

​Recommended models

​All models

​Learn more

Vision understanding guide

Try free

Image and video understanding

Image resolution

Video support

Function calling + built-in tools

Structured output

Recommended models

All models

Learn more