SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

SwanNLP: Simulating Human Nuance in Narrative Word Sense Disambiguation

Summary

Problem

Method

Results

Takeaways

Abstract

The SwanNLP paper introduces a multi-tier LLM framework for rating word sense plausibility in narrative contexts for SemEval-2026 Task 5. It combines Supervised Fine-Tuning (SFT) of low-parameter models with dynamic few-shot RAG for large-scale LLMs, achieving 10th place globally with an average score of 0.760.

TL;DR

The SwanNLP framework addresses the challenge of Narrative Word Sense Disambiguation (WSD) by moving beyond binary "correct/incorrect" labels to human-like plausibility scoring (1-5). By combining fine-tuned 4B models (Qwen/Gemma) with a dynamic few-shot GPT-4o system and meta-learning ensembles (XGBoost), the team successfully mimicked the variance and reasoning of human annotators in the AmbiStory dataset.

Problem: The Limits of Sentence-Level WSD

Most WSD benchmarks operate in "silos"—isolated sentences where the context is thin. In real-world storytelling, a word's sense is often ambiguous and dependent on the "pre-context" or the "ending" of a narrative.

The SemEval-2026 Task 5 highlights a critical pain point: LLMs are great at identifying common senses but struggle with "human-ambiguous" scenarios where a sense is partially plausible. Standard models often fail to capture the subtle gradations (e.g., why a sense is a '3' rather than a '5').

Methodology: A Three-Tiered Reasoning Approach

The researchers didn't just throw a larger model at the problem. Instead, they built a pipeline that respects different levels of reasoning complexity.

1. Difficulty-Aware Fine-Tuning

For low-parameter models (4B range), the authors found that simply predicting a score wasn't enough. They introduced Difficulty Analysis. As shown in the workflow below, they categorized data into "Easy" (High/Low scores) and "Ambiguous" (High variance in human scores) to teach the model when to be uncertain.

Difficulty Classification Flow Figure 1: The logic used to classify whether a case is easy or difficult based on annotator agreement.

2. Dynamic Few-Shot RAG

For commercial models like GPT-4o, they used a Dynamic RAG approach. Instead of static examples, the system retrieves stories from a vector store (FAISS) that share similar ambiguity profiles with the query. This "in-context learning" helps the model calibrate its 1-5 scale based on relevant precedents.

3. Simulating the "Five-Annotator" Consensus

Recognizing that human judgment is rarely a monolith, they treated different models as independent annotators. They tested several ensembling strategies, including Linear Regression Ensembles (LRE) and XGBoost, to weigh the predictions and minimize individual model bias.

Key Results: Qwen's Leap and GPT-4o's Dominance

The experiments revealed a massive gap between "base" models and "reasoning-aware" models:

Small Model Gains: Fine-tuning Qwen-4B with difficulty analysis improved Spearman correlation from a near-random 0.031 to a robust 0.491.
Large Model Prowess: GPT-4o remained the SOTA baseline for plausibility, particularly when aided by dynamic examples.
Ensemble Success: The Linear Regression Ensemble achieved the highest test accuracy (0.797), proving that a "committee of models" better approximates the average human score than any single model.

Ablation Table Table: Performance comparison showing the drastic improvement from 'Base Inference' to 'Difficulty Analysis' fine-tuning.

Critical Insight & Conclusion

The core takeaway is that subjective NLP tasks require structural reasoning. When LLMs are prompted to think about "evidence for" and "evidence against" a sense BEFORE scoring (as seen in the Appendix A prompt), they align much closer to human logic.

However, a lingering limitation exists: even the best models struggle when the context is inherently uncertain. Future research must look into Probabilistic WSD, where models output a distribution of plausibility rather than a single scalar, better reflecting the diversity of human interpretation.

Final Achievement: 10th Place (SemEval-2026 Task 5 Leaderboard).

Find Similar Papers

Try Our Examples

Search for recent papers using the AmbiStory dataset for narrative-level word sense disambiguation beyond SemEval-2026.
Which research first introduced 'difficulty-aware' fine-tuning for LLMs, and how does this paper adapt that logic for WSD?
Explore how multi-annotator simulation techniques in LLMs are being used to handle subjective or ambiguous NLP tasks like humor or sarcasm detection.

Contents

SwanNLP: Simulating Human Nuance in Narrative Word Sense Disambiguation

1. TL;DR

2. Problem: The Limits of Sentence-Level WSD

3. Methodology: A Three-Tiered Reasoning Approach

3.1. 1. Difficulty-Aware Fine-Tuning

3.2. 2. Dynamic Few-Shot RAG

3.3. 3. Simulating the "Five-Annotator" Consensus

4. Key Results: Qwen's Leap and GPT-4o's Dominance

5. Critical Insight & Conclusion