Is constitutional AI better than RLHF for value alignment?

Which method is more secure against attacks?

Constitutional AI (Claude) is more resistant to complex multi-turn prompt attacks than RLHF (ChatGPT), but neither is perfect. In a direct comparison of over 50 prompt attacks under identical conditions, both models completely blocked simple one-shot attacks (0% success rate) [1]. However, for sophisticated multi-turn attacks, Claude (Constitutional AI) had a 17% attack success rate, while ChatGPT (RLHF) had a higher 22% rate [1]. This means Constitutional AI reduced successful attacks by about 23% relative to RLHF in this scenario.

The types of attacks each method is vulnerable to differ. ChatGPT (RLHF) was more susceptible to gradual context manipulation, where an attacker slowly shifts the conversation toward harmful topics. Claude (Constitutional AI) was more vulnerable to authority-mimicking attacks, where the attacker pretends to be a person in power [1]. This suggests the alignment method shapes the specific security weaknesses, so the "better" choice depends on the threat model you face.

Which method handles diverse human preferences better?

Standard RLHF has a fundamental problem with diverse preferences: it can collapse minority viewpoints into majority ones, effectively ignoring them. Researchers proved that using a single reward model in RLHF cannot adequately represent the full range of human preferences [3]. A new variant called MaxMin-RLHF improved win-rates for minority groups by over 33% without hurting majority group performance, achieving an average 16% improvement in win-rates over conventional RLHF [3]. This shows RLHF can be fixed to be fairer, but the standard version is biased.

Constitutional AI also has limitations in handling diverse values. The approach relies on a fixed set of principles, which may not capture moral pluralism—the fact that different people and cultures have different ethical frameworks [2]. Both methods struggle with the "normative dilemma" of whose values to embed, but RLHF at least has a path to incorporate diverse feedback through methods like MaxMin-RLHF, while Constitutional AI's principles are more static [2][3].

What are the deeper problems neither method solves?

Both RLHF and Constitutional AI have serious theoretical and practical shortcomings that no current alignment technique fully addresses. A multidisciplinary critique found that RLHF's goals of "helpful, harmless, and honest" contain inherent tensions—for example, being maximally helpful can conflict with being honest, and both can conflict with being harmless [4]. The paper argues that alignment through feedback methods (whether human or AI) cannot capture the full complexity of human ethics, and that true safety requires broader sociotechnical changes beyond any single alignment technique [4].

RLHF has an additional algorithmic bias that Constitutional AI avoids. Standard RLHF uses a mathematical penalty (KL-divergence regularization) that inherently suppresses minority preferences, potentially leading to "preference collapse" where certain groups' values are virtually ignored [5]. A new method called Preference Matching RLHF was proposed to fix this, but it is not yet widely adopted [5]. Constitutional AI does not have this specific bias because it doesn't optimize a reward model, but it faces the different problem of whose principles to encode and how to update them as societal values evolve [2].

Sources used in this answer

Prompt Injection Vulnerabilities and Data Leakage in ChatGPT and Claude: Toward Safer Conversational AI.

Constitutional AI (Claude) had a 17% attack success rate on complex multi-turn attacks vs 22% for RLHF (ChatGPT), with different vulnerability patterns: ChatGPT more susceptible to gradual context manipulation, Claude more to authority-mimicking attacks.

2025 · Hyun Jung Kim, Sang Hyun Yoo · ICECET

Original

From Principle to Practice: Value Alignment in AI Ethics and Governance

Both RLHF and Constitutional AI face normative dilemmas including moral pluralism and value aggregation; the paper calls for a diversified, interdisciplinary alignment research agenda beyond either technique alone.

2025 · Jianfeng Cao · German Law Journal

Original

MaxMin-RLHF: Alignment with Diverse Human Preferences

Standard RLHF cannot represent diverse human preferences with a single reward model; MaxMin-RLHF improved minority group win-rates by over 33% and overall win-rates by 16% over conventional RLHF.

2024 · Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Dinesh Manocha, Furong Huang, A. S. Bedi, Mengdi Wang · ICML

Original

Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback

RLHF (and RLAIF) have fundamental limitations in capturing human ethics, including inherent tensions between helpfulness, harmlessness, and honesty; true safety requires broader sociotechnical changes.

2025 · Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martinez de Rituerto de Troya, Dimitri Coelho Mollo, Roel Dobbe · Ethics and information technology

Original

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization.

RLHF has an inherent algorithmic bias from KL-divergence regularization that can cause 'preference collapse' where minority preferences are disregarded; Preference Matching RLHF is proposed to mitigate this.

2026 · Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, Weijie J Su · Journal of the American Statistical Association

Original