WisPaper
WisPaper
Search
QA
Pricing
TrueCite

Can LLMs be fully aligned with human values and ethics?

Current research shows LLMs cannot be fully aligned with human values due to inconsistent reasoning, cultural biases, and fundamental ethical trade-offs.

Direct answer

No, large language models cannot currently be fully aligned with human values and ethics. Research shows that even advanced models like GPT-4 and Claude 2 exhibit inconsistent moral reasoning, shifting their ethical stance under pressure [2], and they prioritize values like universalism and self-direction over achievement and security in ways that diverge from human populations across 49 nations [4]. Furthermore, alignment techniques often produce superficial conformity rather than genuine ethical understanding, and the very act of defining evaluation criteria is subjective and iterative, making a fixed standard impossible [1][3].

6sources cited

This article was generated with WisPaper-powered search and paper analysis.

Why do LLMs give inconsistent ethical answers?

Even the most advanced LLMs do not hold stable moral positions. In a 2025 study using the TriEthix benchmark, researchers tested models across 30 realistic moral dilemmas and found that all models changed their ethical stance when pressured to justify or reconsider their choices [2]. The study measured a "flip-rate" — how often a model reversed its decision — and found significant differences between model families and even between versions of the same model. For example, reasoning-focused models were more consistent than non-reasoning ones, but no model was perfectly stable. This means that if you ask an LLM the same ethical question twice with slightly different wording, you may get contradictory answers, which is a fundamental barrier to reliable alignment.

Beyond inconsistency, LLMs also show systematic value biases. A 2024 study using Schwartz's Theory of Basic Values had four leading LLMs (Bard, Claude 2, GPT-3.5, GPT-4) complete a standardized values questionnaire and compared their profiles to data from 53,472 people across 49 nations [4]. All four models prioritized universalism and self-direction far more than humans do, while de-emphasizing achievement, power, and security. These biases are not random — they reflect the values embedded during training and fine-tuning. When the same models were presented with mental health dilemmas, their biased value profiles strongly predicted their decisions, showing that these value patterns actively shape real-world outputs.

Can LLMs adapt to different cultural values?

Current alignment methods tend to produce superficial conformity rather than genuine cross-cultural understanding. A 2025 study proposed a five-step ethical reasoning framework that improved LLM performance on the SafeWorld benchmark, which tests regional value alignment [1]. The framework — which includes contextual fact gathering, social norm identification, and ethical impact analysis — helped models produce more culturally appropriate reasoning. However, even with this improvement, the models still struggled with the complex, context-dependent nature of human values across different regions. The study's authors note that alignment approaches often fail to address the fact that what is ethical in one culture may be unethical in another.

The problem of cultural bias is compounded by what researchers call "criteria drift." In a 2024 study of an interface called EvalGen, which helps users align LLM evaluation with their own preferences, researchers found that users needed criteria to grade outputs, but grading outputs helped users define their criteria — a circular dependency [3]. Some evaluation criteria even appeared to depend on the specific LLM outputs observed, rather than being definable in advance. This means that alignment is not a one-time fix but an iterative, subjective process that varies by user and context. The study raises serious questions for any approach that assumes evaluation criteria can be independent of the model's outputs.

Is full alignment even theoretically possible?

A 2024 theoretical analysis argues that full ethical alignment may be impossible because different ethical frameworks conflict at a fundamental level [6]. The paper distinguishes between Kantian ethics (which treats persons as ends, not means) and utilitarian or just-distribution theories (which focus on aggregate outcomes). It presents the hypothesis that as LLMs become better aligned with both Kantian and just-distribution principles, the value conflicts between them intensify, because self-attention mechanisms may statistically treat the same characters as more "person-like" or more "resource-like" depending on how prompts are phrased. This suggests that alignment is not a problem to be solved but a trade-off to be managed.

Empirical work supports this theoretical concern. A 2024 study of GPT-3.5 repeatedly prompted the model with moral stories and aggregated its responses to produce a human-AI value alignment metric [5]. The study found that the model's alignment varied across different value categories and that the model lacked consistency in its outputs. The authors conclude that understanding a model's alignment is fundamentally an explainability problem — we need to understand how these complex models behave before we can assess their alignment. Until we can reliably predict and control model behavior across diverse ethical scenarios, full alignment remains out of reach.

Sources used in this answer

1

Diverse Human Value Alignment for Large Language Models via Ethical Reasoning

A five-step ethical reasoning framework improved LLM alignment with diverse human values on the SafeWorld benchmark, but the study notes that current approaches often yield superficial conformity rather than genuine ethical understanding.

2

TriEthix: a Triadic Benchmark for Ethical Alignment in Foundation Models

Testing 30 moral dilemmas across frontier LLMs revealed that all models changed their ethical stance under pressure, with flip-rate consistency coefficients varying significantly between model families and scales.

3

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

A mixed-initiative interface (EvalGen) revealed 'criteria drift' — users need criteria to grade outputs, but grading outputs helps define criteria — showing alignment is subjective and iterative.

4

Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values.

Four LLMs (Bard, Claude 2, GPT-3.5, GPT-4) showed value profiles that diverged substantially from 53,472 humans across 49 nations, prioritizing universalism and self-direction while de-emphasizing achievement, power, and security.

5

Measuring Human-AI Value Alignment in Large Language Models

Repeated prompting of GPT-3.5 with moral stories showed that the model's alignment with human values varied across value categories and lacked consistency, making alignment assessment an explainability problem.

6

Learning When Not to Measure: Theorizing Ethical Alignment in LLMs

A theoretical analysis argues that value conflicts between Kantian ethics and just-distribution theories may intensify as LLMs improve alignment with both, because self-attention can treat characters as more person-like or resource-like depending on prompting.