Convert text, images, and video into numerical vectors in a unified semantic space for cross-modal retrieval, similarity search, and content classification.
Before you begin: get an API key, set it as an environment variable, and install the DashScope SDK if you use the SDK.
Endpoint
- HTTP:
POST https://dashscope-intl.aliyuncs.com/api/v1/services/embeddings/multimodal-embedding/multimodal-embedding - SDK
base_http_api_url:https://dashscope-intl.aliyuncs.com/api/v1
Model overview
| Model | Modalities | Dimensions | Image size per image |
|---|---|---|---|
| tongyi-embedding-vision-plus | Text, Image, Video, Multi-images | 64, 128, 256, 512, 1024, 1152 (default) | 10 MB |
| tongyi-embedding-vision-flash | Text, Image, Video, Multi-images | 64, 128, 256, 512, 768 (default) | 5 MB |
Notes
- Image input: Public URL or Base64 data URI (
data:image/{format};base64,{data}). - Multi-images: Key
multi_images. Value is a list of image URLs, max 8 images. - Video input: Must be a public URL. Use the
fpsparameter inparametersto control frame sampling rate (range [0, 1], default 1.0).
Authorizations
string
header
required
DashScope API Key. Create one in the Qwen Cloud console. Alternatively, you can pass the API Key via the X-DashScope-ApiKey request header.
Body
application/jsonResponse
200-application/json
object
object
Token usage statistics. Fields vary by model: tongyi-embedding-vision-* models return input_tokens (combined text and image token count), input_tokens_details, output_tokens, and total_tokens; other models may return different fields — see individual field descriptions.
string
Unique request identifier.
1fff9502-a6c5-9472-9ee1-73930fdd04c5