DashScope multimodal embedding

POST

/services/embeddings/multimodal-embedding/multimodal-embedding

curl --location --request POST \
  'https://dashscope-intl.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
    "model": "tongyi-embedding-vision-plus",
    "input": {
        "contents": [
            {"text": "Multimodal embedding model"},
            {"image": "https://example.com/image.jpg"},
            {"video": "https://example.com/video.mp4"}
        ]
    }
}'

{
  "output": {
    "embeddings": [
      {
        "index": 0,
        "embedding": [
          0
        ],
        "type": "text"
      }
    ]
  },
  "usage": {
    "input_tokens": 0,
    "input_tokens_details": {
      "image_tokens": 0,
      "text_tokens": 0
    },
    "output_tokens": 0,
    "total_tokens": 0,
    "image_tokens": 0
  },
  "request_id": "1fff9502-a6c5-9472-9ee1-73930fdd04c5"
}

Convert text, images, and video into numerical vectors in a unified semantic space for cross-modal retrieval, similarity search, and content classification.

Before you begin: get an API key, set it as an environment variable, and install the DashScope SDK if you use the SDK.

Endpoint

HTTP: POST https://dashscope-intl.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding
SDK base_http_api_url: https://dashscope-intl.aliyuncs.com/api/v1

Model overview

Model	Modalities	Dimensions	Image size per image
tongyi-embedding-vision-plus	Text, Image, Video, Multi-images	64, 128, 256, 512, 1024, 1152 (default)	10 MB
tongyi-embedding-vision-flash	Text, Image, Video, Multi-images	64, 128, 256, 512, 768 (default)	5 MB

Notes

Image input: Public URL or Base64 data URI (data:image/{format};base64,{data}).
Multi-images: Key multi_images. Value is a list of image URLs, max 8 images.
Video input: Must be a public URL. Use the fps parameter in parameters to control frame sampling rate (range [0, 1], default 1.0).

Authorizations

string

header

required

DashScope API Key. Create one in the Qwen Cloud console. Alternatively, you can pass the API Key via the X-DashScope-ApiKey request header.

Body

application/json

enum<string>

required

Model name for multimodal embedding.

Available options:tongyi-embedding-vision-plus,tongyi-embedding-vision-flash

Example:tongyi-embedding-vision-plus

object

required

Input data containing the content items.

Show child attributes

object[]

required

Content items. Each item is an object with one or more modality keys (text, image, video, multi_images). For independent vectors, use one modality per object. For fused vectors, combine modalities in a single object.

Show child attributes

string

Text content to embed.

string

Image URL (public HTTP/HTTPS) or Base64 data URI (data:image/{format};base64,{data}).

string

Video URL (must be a public URL).

string[]

List of image URLs for multi-image embedding. Max 8 images. Only supported by tongyi-embedding-vision-plus and tongyi-embedding-vision-flash.

Required range:items <= 8

object

Parameters for multimodal embedding.

Show child attributes

enum<string>

default"dense"

Output format. Only dense is supported.

Available options:dense

integer

Output vector dimension. Supported values vary by model. See the model overview table for defaults and options.

number

default1

Video frame sampling rate. Range [0, 1]. Default: 1.0.

Required range:0 <= x <= 1

string

Custom task instruction. English recommended. Typically yields 1-5% improvement in retrieval tasks.

Response

200-application/json

object

Show child attributes

object[]

List of embedding results.

Show child attributes

integer

Position index in the input contents list.

number[]

Vector of floating-point numbers.

enum<string>

Content type of this embedding.

Available options:text,image,video

object

Token usage statistics. Fields vary by model: tongyi-embedding-vision-* models return input_tokens (combined text and image token count), input_tokens_details, output_tokens, and total_tokens; other models may return different fields — see individual field descriptions.

Show child attributes

integer

Number of input tokens consumed. For tongyi-embedding-vision-* models, this value includes both text and image/video tokens.

object

Detailed breakdown of input tokens. Only returned by tongyi-embedding-vision-* models.

Show child attributes

integer

Tokens consumed by image/video content in the input.

integer

Tokens consumed by text content in the input.

integer

Number of output tokens. Only returned by tongyi-embedding-vision-* models.

integer

Total token count (input_tokens + output_tokens).

integer

Number of image or video tokens in the input. For video input, the system extracts frames (up to a system-configured limit) and calculates tokens based on the extracted frames.

string

Unique request identifier.

Example:1fff9502-a6c5-9472-9ee1-73930fdd04c5

​Endpoint

​Model overview

​Notes

Authorizations

Body

Response

Endpoint

Model overview

Notes