From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

WisPaper

Pricing

TrueCite

Workspace

Home

Blog

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

From Benchmarking to Reasoning: Unmasking the "Illusion of Competence" in Vietnamese Legal AI

Summary

Problem

Method

Results

Takeaways

Abstract

This paper introduces a dual-aspect evaluation framework to assess Large Language Models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) on Vietnamese legal text simplification. It combines a quantitative benchmark across Accuracy, Readability, and Consistency with a qualitative 9-category error typology validated by legal experts.

TL;DR

Can AI democratize justice by making complex laws understandable? This study evaluates four heavyweights—GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1—on Vietnamese legal texts. The verdict: Fluency is not fidelity. While models like Grok-1 lead in readability and consistency, even the most "accurate" models like Claude 3 Opus fall prey to subtle but catastrophic reasoning failures when applying laws to real-world examples.

Problem & Motivation: Beyond the Leaderboard

In civil law systems like Vietnam's, "legalese" acts as a barrier to justice. LLMs promise to bridge this gap through text simplification. However, current benchmarks suffer from a "What vs. Why" problem.

A model might score 90% on a summarization task, yet miss a single "except for..." clause that changes the entire legal outcome. The author argues that we need to move beyond surface-level metrics to a diagnostic approach that identifies specific failure modes in legal reasoning, especially for low-resource languages like Vietnamese.

Methodology: The Dual-Aspect Framework

The researchers didn't just look at scores; they performed a "medical check-up" on the models' logic using two phases:

Phase 1 (The Scale): A quantitative benchmark evaluating 60 complex articles from the Penal Code, Civil Code, and Land Law across three dimensions: Accuracy, Readability, and Consistency.
Phase 2 (The Diagnostic): A deep dive using a 9-category error typology (validated by legal experts) to categorize 480 model outputs into types like "Omission of Exceptions," "Misinterpretation," and "Incorrect Example."

Error Distribution per LLM Fig 1. Stacked Bar Chart showing the unique "error profiles" of each model. Grok-1 shows significantly lower total error counts.

Experimental Results & Insights

The results revealed distinct "personalities" and trade-offs for each model:

Grok-1 (The Cautious Complier): Surprisingly, Grok-1 dominated in Readability and Consistency. Its secret? It adhered strictly to the source text. However, it struggled with "generative" tasks like creating relevant examples.
Claude 3 Opus (The Ambitious Lawyer): While it scored highest in Accuracy, it committed the most Misinterpretation errors (41). It attempts sophisticated analysis but often "overreaches," leading to plausible-sounding but legally wrong conclusions.
GPT-4o (The Oversimplifier): Its main flaw was Oversimplification. In its quest to be helpful and concise, it often stripped away essential legal conditions, rendering the advice dangerous.

Radar Chart of Overall Performance Fig 2. Performance trade-offs: Claude 3 leads in Accuracy, while Grok-1 dominates in Readability and Consistency.

The "Accuracy Illusion"

The study found that Accuracy scores are often inflated because models handle simple components well, which masks failures in complex reasoning. The most prevalent failure across the board was Error 3.2 (Incorrect Example). Models can rephrase a law (summarization), but they often fail to apply that law correctly to a fictional scenario (reasoning).

Deep Insight: The Alignment Tax

A fascinating takeaway is the discussion on the "Alignment Tax." The author hypothesizes that models like GPT-4o, which undergo heavy safety tuning (RLHF), may become "too safe," leading to oversimplification. Grok-1, appearing to have looser alignment constraints in this study, maintained higher fidelity to the source text, proving that aggressive safety tuning can sometimes degrade performance in specialized technical domains.

Critical Analysis & Conclusion

Takeaway: Current LLMs are linguistically competent but logically fragile in the legal domain. We are seeing an "Illusion of Competence"—where the model sounds like a lawyer but reasons like a student.

Limitations:

The study used 60 articles; while deep, a larger corpus is needed for absolute generalizability.
Evaluations were done in a zero-shot setting; techniques like Chain-of-Thought (CoT) or RAG (Retrieval-Augmented Generation) might improve these "lower-bound" results.

Future Outlook: The path forward isn't just "bigger models," but better Risk-Aware Human-in-the-loop systems. For instance, we should route "example generation" to human lawyers while using AI for initial structuring. This framework provides the blueprint for auditing AI reliability in any high-stakes, civil-law jurisdiction.

Find Similar Papers

Try Our Examples

Search for recent studies exploring the "alignment tax" in Mixture-of-Experts (MoE) models versus dense models in specialized domains like law or medicine.
Which paper first identified the "fluency-accuracy trade-off" in LLM-based text simplification, and how does it relate to the Vietnamese legal context?
Investigate the effectiveness of Chain-of-Thought (CoT) prompting specifically for reducing "Oversimplification" and "Incorrect Example" errors in civil law jurisdictions.

Contents

From Benchmarking to Reasoning: Unmasking the "Illusion of Competence" in Vietnamese Legal AI

1. TL;DR

2. Problem & Motivation: Beyond the Leaderboard

3. Methodology: The Dual-Aspect Framework

4. Experimental Results & Insights

4.1. The "Accuracy Illusion"

5. Deep Insight: The Alignment Tax

6. Critical Analysis & Conclusion