Convert text, images, and video into numerical vectors in a unified semantic space for cross-modal retrieval, similarity search, and content classification.
Before you begin: get an API key, set it as an environment variable, and install the DashScope SDK if you use the SDK.
Endpoint
- HTTP:
POST https://dashscope-intl.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding - SDK
base_http_api_url:https://dashscope-intl.aliyuncs.com/api/v1
Model overview
| Model | Modalities | Dimensions |
|---|---|---|
| tongyi-embedding-vision-plus | Text, Image, Video, Multi-images | 64, 128, 256, 512, 1024, 1152 (default) |
| tongyi-embedding-vision-flash | Text, Image, Video, Multi-images | 64, 128, 256, 512, 768 (default) |
Notes
- Image input: Public URL or Base64 data URI (
data:image/{format};base64,{data}). - Multi-images: Key
multi_images. Value is a list of image URLs, max 8 images. - Video input: Must be a public URL. Use the
fpsparameter inparametersto control frame sampling rate (range [0, 1], default 1.0).
Authorizations
string
header
required
DashScope API Key. Create one in the Qwen Cloud console. Alternatively, you can pass the API Key via the X-DashScope-ApiKey request header.