← Blog

Essential ML Guide

Unpacking chain-of-thought prompting: a new paradigm in AI reasoning

Toloka Team

on October 8, 2024

High-quality human expert data. Now accessible for all on Toloka Platform.

Learn more

High-quality human expert data. Now accessible for all on Toloka Platform.

Learn more

What if your LLM could finally understand nuance like a human?

Better data drives deeper understanding

Get traning data

The release of OpenAI's o1 model, originally codenamed Strawberry by its developers, marks a pivotal moment in the evolution of the entire AI domain. In particular, it concerns machine learning models' ability to solve sophisticated problems that require step-by-step logical reasoning.

This advancement is driven mainly by Chain-of-Thought (CoT) prompting, a technique designed to emulate human cognitive processes by breaking down complex problems into manageable steps. This method significantly increases the likelihood of delivering well-grounded results, though it might slow response times.

OpenAI o1’s results for specific problem-solving tasks as compared to its predecessor GPT-4o. Source: OpenAI

In this article, we’ll explore Chain-of-Thought prompting, how it powers the new and widely hyped AI models, and why it represents such a significant leap forward. From improving accuracy in symbolic reasoning tasks to reshaping everyday ML applications, CoT-based systems are setting the stage for the next generation of AI capabilities.

What is Chain of Thought Prompting?

At its core, Chain-of-Thought (CoT) prompting encourages models to generate intermediate steps before arriving at a final answer. Unlike traditional large language models (LLMs), which often provide a direct, single-step response, CoT-based models work through problems by breaking them into smaller, logical sub-tasks. This is similar to making the artificial intelligence “think out loud,” giving better-considered outputs.

A simple example of the Chain-of-Thought prompting technique. Source: Navigate through Enigmatic Labyrinth

For example, instead of instantly responding to a complex question, a CoT-based model analyzes and articulates each intermediate step, working toward a solution. This method has proven especially useful in tasks like arithmetic problem-solving, where step-by-step calculation is necessary, or in commonsense reasoning, which requires contextual interpretation.

Chain-of-Thought prompting vs. Standard prompting. Source: Chain-of-Thought Prompting Elicits Reasoning in LLMs

Besides boosting the overall accuracy, this CoT prompting enhances transparency in AI decision-making. The model’s ability to explain its intermediate steps offers a clearer understanding of its reasoning process, making it easier to identify and correct errors. This approach also demystifies part of the “magic” under the model’s hood, reducing the sense that it's operating as a black box for most users.

A key advantage of Chain-of-Thought (CoT) prompting is that it encourages more explicit reasoning steps, which helps transformer models leverage their self-attention mechanism more effectively. By guiding the model to break down tasks into intermediate steps, CoT prompting allows the self-attention layers to better capture dependencies across tokens, enhancing the model's ability to process complex relationships within a sequence. This structured reasoning process leads to more accurate decision-making, especially in tasks that require long-term dependencies to be understood and integrated over multiple steps.

Visualization of the attention mechanism following long-term dependencies. Source: Attention Is All You Need

CoT also impacts how loss gradients are calculated during model training. When CoT introduces multi-step reasoning, it results in a more granular process for updating model weights through backpropagation. Instead of adjusting weights based on a single output, the model receives intermediate feedback after each step, allowing it to fine-tune how it processes tasks and improves over multiple iterations.

The CoT prompting technique was introduced in 2022 when a team from Google Research published a landmark paper titled "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Google's researchers demonstrated that prompting models to break down complex reasoning problems into a sequence of intermediate steps allows significant improvements in arithmetic, logic puzzles, and symbolic reasoning tasks.

Examples of input, chain of thought, and output triples for arithmetic, commonsense, and symbolic reasoning tasks benchmarks from the article Chain-of-Thought Prompting Elicits Reasoning in LLMs

The original experiments showed that even with simple prompting, models could solve much more complex tasks than was previously possible, and this concept has since been built upon by others in the field to develop even more advanced capabilities.

Further empirical results have shown how CoT impacts specific performance metrics. For example, CoT prompting has improved BLEU scores and reduced perplexity across several benchmarks. In tasks such as those found in the GSM8K dataset (designed for arithmetic reasoning), CoT-enabled models significantly outperform standard few-shot prompting techniques.

Crossword puzzles are especially hard for LLMs because they require iterative solving, but the o1 model does well. Source: Something New: On OpenAI's "Strawberry" and Reasoning

With OpenAI's latest release of the o1 model, which builds upon these advancements, CoT techniques have become even more integral to how models handle complex reasoning tasks.

Traditional Prompting vs. Chain-of-Thought Prompting

Traditional prompting in language models primarily relies on next-token prediction, focusing on generating the most probable next word in a sequence based solely on the previous tokens. This approach does not take into account intermediate validation or reasoning processes. Consequently, the model may produce an answer that seems coherent but lacks the necessary steps for thorough problem-solving or logical reasoning.

In contrast, CoT prompting instructs the model to generate a series of intermediate states, each representing a token prediction contributing to the final output. This step-by-step reasoning allows for a more nuanced exploration of the problem.

CoT prompting outperforms standard prompting for various annotators. Source: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Traditionally, we train AI models using input-output pairs that enable them to recognize patterns and generate responses aligned with the provided data. While effective for many tasks, this method may struggle with less straightforward problems or tasks the model hasn’t encountered during training.

Few-shot prompting, a form of in-context learning, addresses this limitation by directly incorporating a small number of task demonstrations into the prompt. These examples guide the model's understanding of the desired output, even when it lacks training on a specific task, making few-shot prompting useful for more niche problems.

A Few-shot prompting example. Source: Machine Learning Mastery

The few-shot prompting method can still fail when applied to multi-step problems. The primary issue is that it often skips over intermediate stages, offering a direct answer without revealing how the model arrived at a particular conclusion.

Chain-of-thought prompting improves model training by directly influencing the task decomposition process, which helps optimize loss gradients during backpropagation. In traditional models, a single-step response often leads to shallow gradient updates as the model focuses only on the final prediction.

CoT prompting ensures that backpropagation captures more nuanced information across the decision-making path, allowing gradients to flow through a deeper chain of dependencies. As a result, weight updates in a CoT-based model are distributed more evenly across the sequence, improving the model's ability to learn complex tasks by capturing interdependencies between reasoning steps.

Chain-of-Thought Prompting Advantages

Improved accuracy

By prompting the model to work through each logical step, CoT helps reduce errors, especially in tasks that demand multiple layers of understanding.

Transparency

Users can follow the model's thought process and detect errors in its reasoning chain, improving trust in its output.

Handling ambiguity

In complex or ambiguous scenarios, the model's ability to articulate intermediate steps provides greater clarity on how it arrives at its conclusion.

Attention to detail

The step-by-step explanation model encourages a detailed understanding of each component of the problem. This makes CoT prompting helpful in the education domain, where the goal is to develop critical thinking and problem-solving skills in students.

Confidence Calibration

Empirical research indicates that CoT-based models tend to have lower variance in their predictions for multi-step reasoning tasks. This means the models are better calibrated regarding their confidence levels, making their outputs more reliable.

CoT prompting significantly enhances the model's ability to tackle tasks requiring multi-step reasoning, such as arithmetic problem-solving, logic tasks, and scientific inquiry. It shifts the focus from merely providing an answer to offering a transparent, interpretable reasoning process behind the solution.

How Chain of Thought Prompting Works

At the core of CoT prompting is the principle of hierarchical reasoning, which assumes that no input task should be treated as an isolated unit. This approach reflects real-world scenarios, where complex issues are typically resolved through a series of interrelated logical steps that build upon one another.

CoT prompting guides large language models in tackling problems in a layered fashion. Each intermediate step adds context for the subsequent one and enables the model to refine its understanding of the task iteratively.

The language-based nature of CoT prompting also makes it applicable to commonsense reasoning. Source: Language Models Perform Reasoning via Chain of Thought

Without the help of Chain-of-Thought (CoT) prompting, models often default to surface-level pattern recognition or statistical inference. This can result in answers that appear correct but lack the depth and reasoning required for handling complex tasks effectively.

A theoretical parallel to CoT prompting can be drawn with memory-augmented neural networks, such as those using external memory or differentiable neural computers (DNCs). While memory networks explicitly store and retrieve information across multiple time steps, CoT prompting achieves similar benefits by encouraging models to structure their reasoning in intermediate steps. This approach helps models retain partial computations or logical structures crucial for complex reasoning, making CoT prompting resilient to task ambiguity and enabling coherent reasoning even in challenging scenarios.

Reducing cascade errors in the reasoning process may require verification and refinement mechanisms. Source: Navigate through Enigmatic Labyrinth

CoT prompting leverages attention mechanisms integral to Transformer-based architectures, such as those underpinning models like GPT and BERT. These mechanisms enable models to dynamically weigh the importance of different parts of the input sequence. In the context of CoT, attention helps the model focus on relevant intermediate steps, reducing the risk of overlooking critical elements in the problem-solving process.

Attention layers facilitate the connection of distant tokens within a sequence, allowing the reasoning process to extend across multiple layers without losing coherence. This attribute enables the model to construct a logical chain where each intermediate step contributes meaningfully to the final answer.

The distributed nature of attention across multiple heads aids the model in processing parallel information streams, which is vital for handling multi-step tasks like arithmetic and complex language reasoning.

Challenges and Limitations of Chain of Thought Prompting

Efficiency Concerns

Engaging in deeper, structured computation requires generating and processing multiple intermediate outputs. This often leads to a slower response, as the model must work through a sequence of reasoning steps instead of providing a direct output. The computational cost of these additional steps can escalate quickly, resulting in higher energy consumption and potentially limiting the practicality of CoT in real-time applications.

Joanne Jang, OpenAI product manager, warned about excessive expectations from o1—as for some tasks, it currently falls short compared to classic LLMs. Source: X

Correctness of Reasoning

Although CoT enhances transparency, it does not guarantee that every step in the reasoning chain is accurate. Large language models may produce questionable intermediate steps, leading to incorrect final answers. This highlights the importance of self-verification mechanisms that allow models to assess the validity of each reasoning stage.

Overfitting to a Process

CoT prompting can also result in models becoming overly rigid in their approach. By adhering too closely to a multi-step reasoning framework, models might over-elaborate on simple tasks where a quick response would suffice. This overfitting to the reasoning process can hinder efficiency and user experience, particularly when speed is essential.

Risk of Confusion

If the model produces self-contradictory intermediate results or gets stuck in a faulty reasoning loop, it may continue to produce inaccurate final answers, compounding the error. This risk underscores the need for robust error-handling strategies and training protocols to guide the model away from misleading paths.

Training Requirements

CoT prompting is particularly sensitive to the quality of training data. To be effective, the model must be exposed to tasks requiring multi-step reasoning during training, which is resource-intensive.

The complexity of developing suitable datasets and training methodologies adds another layer of difficulty to implementing CoT prompting effectively. Additionally, models may need continuous fine-tuning to adapt to various contexts.

Chain-of-Thought Prompting Methods

CoT prompting comprises several unique techniques, each catering to different problem complexities or types of reasoning required.

Zero-Shot Chain-of-Thought Prompting

Zero-shot CoT prompting refers to a scenario in which the model is prompted to generate a reasoning chain without prior task-specific examples or demonstrations. Instead, it relies entirely on the prompt’s instructions, leveraging the model’s internal knowledge and capacity for step-by-step problem-solving.

The success of zero-shot CoT comes from the pre-trained knowledge embedded in the model, which guides it to produce logical steps using only general instruction. This is especially useful when providing task-specific examples that might be infeasible or expensive.

Example

A basic prompt for Zero-Shot CoT might be: "Explain the reasoning step-by-step before giving the final answer." This can prompt the model to generate intermediate reasoning steps before arriving at the final answer.

A pipeline using two Zero-Shot CoT prompts. Source: Large Language Models are Zero-Shot Reasoners

However, it may be less efficient for complex tasks.

Few-Shot Chain-of-Thought Prompting

Few-shot CoT prompting is a more common variant that provides several examples of reasoning steps as part of the input prompt. These examples act as templates, demonstrating how the model should break down the problems it’s trying to solve.

This approach excels in tasks where intermediate steps are crucial to derive the final answer, such as math or logic puzzles.

Example

A prompt might include examples like: "Q: What is 23 + 17? A: First, break down the numbers into tens and units. Add 20 and 10 to get 30, then add 3 and 7 to get 10. Finally, add 30 and 10 to get 40."

This structured reasoning helps guide the model to perform similar breakdowns on unseen problems.

Multi-Step Chain-of-Thought Prompting

The model explicitly breaks down problems into multiple steps in multi-step CoT prompting, ensuring each intermediate step is correct before proceeding to the next. This variant leverages self-verification mechanisms to reduce the risk of errors cascading through the reasoning chain.

Example

A multi-step CoT task might involve a reasoning prompt such as: "Break this problem into three parts: (1) interpret the question, (2) apply the first reasoning step, and (3) verify the result before proceeding to the final answer."

By segmenting the process, the model can systematically address each part of the problem.

Self-Consistency Chain-of-Thought Prompting

Self-consistency CoT prompting is a more advanced variant where the model generates multiple reasoning chains for a given problem and selects the most consistent answer. This approach recognizes that sometimes the model may generate varied chains of reasoning. So, instead of relying on the first answer, it looks for agreement patterns across multiple reasoning paths.

How It Works:

The model generates several possible reasoning chains for a task.
The final answer is based on the majority consensus or the most frequent outcome across the chains.

This method improves accuracy and confidence in the final result, as it incorporates a form of ensemble reasoning where multiple "thoughts" are considered before a conclusion is drawn.

Overview of the Auto-CoT method. Source: Automatic CoT Prompting in large language models

Interactive Chain-of-Thought Prompting

Interactive CoT prompting involves a more dynamic approach where the model iterates on reasoning steps interactively based on feedback. This variant typically involves a human-in-the-loop setup where a user or another model guides the reasoning process, suggesting corrections or clarifications at each step.

Use Cases:

Educational tools: Users can interact with the model to explore different paths of reasoning.
Collaborative Problem Solving: Teams can use interactive CoT to guide LLMs through complex decision-making processes.

These variants offer versatile strategies to tackle various reasoning tasks, from basic arithmetic to complex scientific problems.

The general pipeline for Chain-of-Thought prompting strategy application. Source: Towards Better Chain-of-Thought Prompting Strategies

Final Thoughts

While the recent advancements in OpenAI's o1 models are certainly exciting, they underscore both the potential and the limitations of current AI technologies. These models have demonstrated impressive capabilities in tackling complex problems that many other AI systems struggle to solve. However, they also exhibit slower performance and cannot process various data types like images.

The improved o1 model ranked in the 49th percentile in the 2024 International Olympiad in Informatics under competition rules. Source: OpenAI

As we continue to explore the potential of these new models, it's essential to recognize that existing large language models still have their place in the AI landscape, offering a range of functionalities. The future lies in the collaboration and integration of various models, ensuring users can leverage the strengths of each to achieve their goals effectively.

Evolution of reasoning topologies used in prompting schemes. Source: Demystifying Chains, Trees, and Graphs of Thoughts

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.