Constitutional AI explained

Toloka Team
by Toloka Team

Subscribe to Toloka News

Subscribe to Toloka News

Anthropic's solution in the form of Constitution AI has the potential to bring new and significant changes to the way AI systems are designed. Any artificial intelligence must be harmless and helpful. And that's exactly what the Constitutional AI approach seeks to achieve.

By establishing a constitution — a set of fundamental principles and values — Anthropic aims to provide a transparent and precise framework for the development and evaluation of AI models. This constitution outlines the core principles that AI models must follow to ensure harmlessness and helpfulness.

Empower your GenAI development

Get your expert data for Fine-tuning, RLHF and Evaluation. High-quality, for any domain, at scale.
Talk to us

What is Constitutional AI?

Constitutional AI is an AI that follows a certain set of rules that help it become more helpful and harmless. Hence its name, as it adheres to a defined constitution. This approach addresses the legal, ethical, and societal implications of AI deployment, ensuring that AI systems operate within the bounds of constitutional principles, such as human rights, privacy protections, due process, and equality before the law.

Undesired behavior and harmful outputs are critical concerns in the development and deployment of AI systems. Constitutional AI aligns language models with human values, resulting in a harmless AI assistant. The concept of Constitutional AI is a fascinating approach to addressing the challenges of creating AI models that are both helpful and harmless. By establishing a constitution for AI models, the process aims to provide a transparent and principled framework for guiding model behavior.

What’s the origin of Constitutional AI?

Researchers at Anthropic came up with the idea of a constitutional AI. In the paper Constitutional AI: Harmlessness from AI Feedback, they claim that there is no need for human labels to identify harmful outputs during constitutional AI training because only human oversight can be represented by a set of rules or principles (or constitution).

This eliminates the need for an immense amount of hard work involved in providing feedback by humans, as in Reinforcement Learning from Human Feedback (RLHF), as the desired behavior of an AI system is achieved by specifying the rules in the constitution. The reliance on humans is reduced, as the principles established in the constitution guide model development and evaluation. This leads to more scalable and efficient model training processes.

Choosing constitutional principles is crucial in this process, as it shapes the ethical and moral foundation upon which the AI models operate. Anthropic drew inspiration from various sources, including Apple’s Terms of Service, the UN Declaration on Human Rights, and suggestions from other research labs.

Integrating principles from a variety of sources enables Anthropic to build a constitution that reflects a wider view of ethical AI behavior and is aligned with the values and expectations of human communities. This approach enhances transparency, accountability, and trust in AI systems by providing a clear and publicly available framework for assessing their behavior and enforcing ethical standards.

By adhering to the principles outlined in the constitution, AI models become more transparent in their decision-making processes. The rules specified in the constitution provide a clear framework for developers and users to understand and evaluate AI models' behavior. To create a constitutional AI, Anthropic included both a supervised learning and a reinforcement learning phase in the training process.

Supervised learning phase

The supervised learning stage of Anthropic's Constitutional AI approach includes the following steps:

  • Using a pre-trained helpful model. Anthropic starts with a pre-trained RLHF model that is highly helpful but lacks the training to ensure harmlessness. This model has already undergone human feedback to maximize its helpfulness;

  • Exposing the model to toxic prompts. The model is exposed to toxic prompts that surely lead to harmful responses because such prompts challenge the model to produce harmful content;

  • Few-shot learning. Before the critique and revision process, the model is shown several examples of how the process should look. This few-shot learning helps the model understand the task and improves its performance. It guides the model through the critique and revision process, ultimately allowing it to produce a harmless response to a harmful prompt;

  • Critique and revision process. Using randomly selected constitutional principles, the model is then encouraged to criticize its detrimental responses. The model is requested to recompose its original harmful response according to the chosen principle, essentially making it harmless. These responses form a harmless dataset for model fine-tuning.

The researchers found that the initial revision process effectively removed harmful content from the model's responses most of the time. Further revisions only improve the output, and all of the revisions are utilized to fine-tune the initial model. This ensures that the model becomes more harmless while retaining its helpfulness. Notably, the model learned to handle toxic content sensitively rather than becoming evasive.

Overall, this supervised learning phase in building Constitutional AI demonstrates a systematic approach to training AI models to respond to toxic inquiries harmlessly. Through the use of constitutional principles and iterative critique and revision, the model learns to address sensitive topics delicately and effectively.

Reinforcement learning phase

In the reinforcement learning phase of Anthropic's Constitutional AI approach, the focus is on further training the AI model using pairs of responses generated and evaluated by the model. Here's a breakdown of the process:

  • Prompting and response generation. A pre-trained model that was finetuned in the supervised phase generates two outputs. These AI-generated responses are alternatives to address the given prompt;

  • Evaluation against constitutional principles. The same model chooses the best answer according to a random rule from a constitution. This evaluation helps the model learn to prioritize responses that align with the principles outlined in the Constitution. At this stage, the chain-of-thought prompting is used to help the model think about the problem step-by-step, which guides the model to consider its responses in a structured way, potentially improving the quality of its assessments;

  • Creating a dataset. These pairs of responses form an AI-generated preference dataset;

  • Training a preference model. The preference model learns to calculate the log probability of each response being chosen as more suitable. The training of this model incorporates both AI-generated feedback and human preferences. After training, the preference model is employed for the fine-tuning of the baseline model;

  • Finetuning the original model. The trained preference model is also used to train the original supervised learning model. This process adjusts the parameters of the original model to optimize its responses according to the preferences learned during training. It is analogous to the RLHF approach, but instead of human feedback incorporates preference data from AI feedback.

Benefits of Constitutional AI

Constitutional AI offers several significant benefits contributing to the responsible and ethical development and deployment of AI systems. Here are some key advantages:

  • Transparency and accountability. By establishing a constitution that outlines the principles guiding AI behavior, Constitutional AI promotes transparency and accountability. It helps gain a clear understanding of the ethical framework underpinning AI systems;

  • Risk mitigation. By prioritizing harmlessness, Constitutional AI helps mitigate the risks associated with AI technologies, such as bias, discrimination, and unintended consequences. The Constitution serves as a safeguard against harmful outcomes;

  • Scalability and reduced reliance on feedback from humans. Traditional AI model training methods rely heavily on human feedback, which can be time-consuming, labor-intensive, and subjective. By using AI feedback, Constitutional AI reduces the reliance on human annotators, making the process more scalable and efficient.


While the constitutional AI approach has many advantages, defining the constitutional principles guiding AI can be complex and subtle. Defining and expressing principles that are comprehensive, clear, and adaptable to different contexts requires careful consideration and considerable expertise.

In addition, ethical standards and societal norms are highly dynamic and are constantly evolving. AI constitutional frameworks must be flexible and adaptable to accommodate changes in ethical considerations and societal values. Despite these challenges, if approached wisely, the Constitutional AI framework can guide the development of AI systems that benefit individuals, communities, and society at large.

Currently, the most effective approach for maintaining ethical standards and aligning with societal norms involves integrating human input into AI training processes. By partnering with Toloka, which harnesses the expertise of professionals across diverse domains, you can fine-tune your model to meet your specific requirements and ensure ethical compliance

Read more about building Gen AI ethically:

Article written by:
Toloka Team
Toloka Team

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.