This paper presents a unified systems-level review of Large Language Model (LLM) agents through the lens of externalization. It identifies three core dimensions—Memory, Skills, and Protocols—and introduces Harness Engineering as the integration layer that transforms internal cognitive burdens into reliable external structures.
TL;DR
The secret to reliable AI agents is not more parameters, but Externalization. This paper argues that modern agent design is shifting from "Model-Centric" to "Harness-Centric," moving cognitive burdens—like memory, procedural expertise, and interaction rules—out of the model’s weights and into a managed infrastructure called the Harness.
Background: The Outward Migration of Intelligence
For years, the industry assumed that "smarter weights = better agents." However, even the most powerful models fail at long-term consistency. The researchers track a historical arc of capability:
- Weights (2022-2023): Knowledge is frozen in parameters. Hard to update, easy to hallucinate.
- Context (2023-2024): Using RAG and Prompting (Chain-of-Thought) to "stage" cognition.
- Harness (2025-2026): Intelligence is distributed across persistent memory, skill registries, and standardized protocols.
Figure 1: Just as humans moved from internal thought to writing and digital computation, LLM agents are moving from internal weights to external harnesses.
The Core Architecture: Memory, Skills, and Protocols
The paper breaks down "Agency" into three externalized modules that transform the model's task:
1. Externalized State: Memory
Instead of cramming everything into a fragile context window, the harness uses a Hierarchical Memory Architecture.
- Transformation: From Recall (hard) to Recognition/Retrieval (reliable).
- The Design: Modern systems use "OS-style" memory management, swapping "hot" context (active work) for "cold" storage (historical episodes).
2. Externalized Expertise: Skills
A model shouldn't have to "invent" a workflow every time.
- Transformation: From Improvised Generation to Structured Composition.
- The Artifact: "Skill files" (like
SKILL.md) encapsulate procedures, decision heuristics, and safety constraints. The agent simply loads the expertise required for the task.
3. Externalized Interaction: Protocols
Interaction with tools (APIs) or other agents is often the point of failure.
- Transformation: From Ad-hoc Language to Structured Contracts.
- The Solution: Standards like the Model Context Protocol (MCP) or Agent-to-Agent (A2A) protocols ensure that communication is typed, validated, and secure.
Methodology: The Harness as a Cognitive Environment
The "Harness" is the runtime that hosts these modules. It isn't just "glue code"; it's a Cognitive Environment that implements:
- Agent Loops: The "Perceive-Plan-Act" cycle.
- Sandboxing: Forcing the model to work in a safe, isolated area.
- Observability: Creating a "black box recorder" of every decision.
Figure 2: The community focus is shifting from the 'Agent Core' (LLM) to the 'Harness' (Infrastructure).
Deep Insight: The Cerebrum vs. The Cerebellum
In a fascinating extension to Embodied AI (Robotics), the authors suggest we are seeing a "Cerebrum–Cerebellum split."
- The Cerebrum (LLM Agent): Handles high-level reasoning and task decomposition.
- The Cerebellum (VLA Models): Handles fast, reactive motor control (e.g., grasping an object). By externalizing motor control as a "Skill," the high-level brain is freed to focus on the goal, making robots significantly more robust.
Critical Analysis & Conclusion
This paper changes the "SOTA" definition. A system’s power is no longer just its benchmark score on a static test, but its externalization quality.
The Trade-off: Externalization adds latency and context overhead. If you externalize too much, the model spends all its time reading manuals rather than working. The future of AI research will be "Partitioning": deciding exactly which 10% of a task should stay in the model's brain and which 90% should be moved to the harness.
Final Takeaway: We are entering the era of "Distributed Agency." If you want to build a better AI assistant, stop tuning the model and start building a better environment for it to live in.
