Make your first text generation call
Text generation models take natural language input and generate text for tasks such as question answering, writing, summarization, translation, and structured output generation.
Text generation requests are typically sent as a
The model returns its reply as an
Before you begin, get an API key, set it as an environment variable, and, if needed, install the OpenAI or DashScope SDK.
Choose the API style that matches your stack:
Once a basic synchronous request is working, asynchronous calls can improve throughput for high-concurrency workloads.
Response
Feeding raw data directly to a large language model increases costs and reduces quality because of context-length limits. Context engineering improves output quality and efficiency by dynamically loading precise knowledge. Core techniques:
The
For complex scenarios:
For a complete list of model invocation parameters, see OpenAI Compatible API Reference and DashScope API Reference.
Why can't the Qwen API analyze web links?
The Qwen API cannot directly access or parse web links. You can use Function calling, or combine them with web scraping tools such as Python's Beautiful Soup to read webpage content.
Why do Qwen web app and API responses differ?
The Qwen web app includes additional engineering optimizations beyond the Qwen API, enabling features such as webpage parsing, web search, image drawing, and PPT creation. These capabilities are not part of the core large language model API. You can replicate them using Function calling to enhance model performance.
Can the model directly generate Word, Excel, PDF, or PPT files?
No, they cannot. Qwen Cloud text generation models output only plain text. You can convert the text to your desired format using code or third-party libraries.
Request structure
Text generation requests are typically sent as a messages array. Each message includes a role and content.
- System message: Provides high-level instructions or sets the model's behavior.
- User message: Contains the user's input or task.
- Assistant message: Contains the model's response.
user message and can optionally include a system message for more stable or more controllable output.
The
system message is optional, but recommended when you want more consistent behavior.assistant message.
Make your first call
Before you begin, get an API key, set it as an environment variable, and, if needed, install the OpenAI or DashScope SDK.
Choose the API style that matches your stack:
- Start with OpenAI Compatible -- Responses API for new integrations.
- Use OpenAI Compatible -- Chat Completions API if you are migrating existing OpenAI-compatible code.
- Use DashScope if you prefer the native SDK.
- OpenAI Compatible -- Responses API
- OpenAI Compatible -- Chat Completions API
- DashScope
- DashScope -- Qwen3.5/3.6
For usage notes, code examples, and migration guidance, see OpenAI compatible - Responses.ResponseThe response includes these main fields:
-
id: The response ID. -
output: The output list. Includesreasoningandmessage.Thereasoningfield appears only when deep thinking is enabled (for example, it is enabled by default in the Qwen3.5 and Qwen3.6 series). -
usage: Token usage statistics.
Full JSON response
Full JSON response
Handle requests asynchronously
Once a basic synchronous request is working, asynchronous calls can improve throughput for high-concurrency workloads.
- OpenAI Compatible -- Chat Completions API
- DashScope
Because the call is asynchronous, the order of the responses may differ from the example.
Going live
Build better context
Feeding raw data directly to a large language model increases costs and reduces quality because of context-length limits. Context engineering improves output quality and efficiency by dynamically loading precise knowledge. Core techniques:
- Prompt engineering: Design and refine text instructions (prompts) to guide the model toward the desired outputs. For more information, see Text-to-text prompt guide.
- Retrieval-augmented generation (RAG): Use this technique when the model must answer questions using external knowledge bases, such as product documentation or technical manuals.
- Tool calling: Allows the model to fetch real-time data, such as weather or traffic, or perform actions, such as calling an API or sending an email.
- Memory mechanisms: Provide the model with short-term and long-term memory to understand conversation history.
Tune generation behavior
The temperature and top_p parameters control the diversity of the generated text. Higher values increase diversity, and lower values increase predictability. To assess the effects of these parameters, adjust only one at a time.
temperature: A value in the range of [0, 2) that adjusts randomness.top_p: A value in the range of (0, 1.0] that filters responses by a probability threshold.
- High diversity (for example,
temperature=0.9): Use this setting for creative writing, brainstorming, or marketing copy where novelty and imagination are important.
- High predictability (for example,
temperature=0.1): Use this setting for factual Q&A, code generation, or legal text where accuracy and consistency are critical.
How it works
How it works
temperature:
- A higher temperature flattens the token probability distribution. This makes high-probability tokens less likely and low-probability tokens more likely, which causes the model to be more random when it chooses the next token.
- A lower temperature sharpens the token probability distribution. This makes high-probability tokens even more likely and low-probability tokens less likely, which causes the model to favor high-probability tokens.
- A higher top_p value considers more tokens, which increases diversity.
- A lower top_p value considers fewer tokens, which increases focus and predictability.
Parameter settings for common scenarios
Parameter settings for common scenarios
Explore more text generation features
For complex scenarios:
- Multi-turn conversations: Use this feature for follow-up questions or information gathering that requires continuous dialogue.
- Streaming output: Use this feature for chatbots or real-time code generation to improve the user experience and avoid timeouts caused by long responses.
- Deep thinking: Use this feature for complex reasoning or policy analysis that requires high-quality, structured answers.
- Structured output: Use this feature when you need the model to reply in a stable JSON format for programmatic use or data parsing.
- Partial mode: Use this feature for code completion or long-form writing where the model continues from existing text.