[Memento-Skills] The Agent as Architect: Achieving Continual Learning Without Gradient Descent

总结

问题

方法

结果

要点

摘要

Memento-Skills is a generalist LLM agent system that functions as an "agent-designing agent," autonomously constructing and refining a library of reusable skills (code, prompts, and markdown) to solve complex tasks. Built on the Memento 2 framework, it achieves SOTA performance on GAIA and HLE benchmarks without updating any LLM parameters.

TL;DR

The Memento-Team has introduced Memento-Skills, an agentic system that doesn't just execute tasks—it designs the tools to solve them. By treating "skills" (modular code and prompts) as an evolving external memory, the system achieves massive performance gains (+116% on hard benchmarks) while keeping the underlying LLM (e.g., Gemini 3.1) completely frozen. It effectively replaces a "static model" with a "dynamic library of expertise."

The "Frozen Brain" Problem

Most current LLM agents suffer from a fundamental paradox: they are deployed with fixed parameters ( $h e t a$ ), making them "stateless." Once they hit a task outside their training distribution or fail at a specific workflow, they have no intrinsic way to "remember" the mistake or "learn" a better route for next time.

The authors argue that we shouldn't be "rewriting neurons" (fine-tuning) for every new task. Instead, we should give the agent a writable notebook of executable skills. The challenge, however, is making this notebook more than just a graveyard of past logs—it needs to be a structured, evolving codebase.

Methodology: The Read-Write Reflective Loop

The core of Memento-Skills is the Stateful Reflective Decision Process (SRDP). It breaks down the agent's life into a continuous loop of five steps:

Observe: Take in the user query.
Read (Skill Selection): Use a behavior-aligned router to pick the best "skill" (a folder containing code, prompts, and documentation).
Act: Execute the skill's multi-step workflow.
Feedback: A "Judge" LLM evaluates the outcome.
Write (Evolution): If it failed, a failure attribution module identifies the bug and rewrites the skill's code or prompt to add guardrails.

1. The Skill-Centric Architecture

Unlike traditional RAG that retrieves text chunks, Memento-Skills retrieves functional modules. Model Architecture Figure: The system decomposes 30,000 lines of hard-coded "if-else" logic into a clean, modular Skill System managed by an Evolution Engine.

2. Behavior-Aligned Routing

One of the paper's key insights is that Cosine Similarity is not enough. Just because a task sounds like a "refund request" doesn't mean the "refund skill" is the best behavior if the user is actually asking for a "password reset" to access the refund page. The team trained a Memento-Qwen router using single-step offline RL. Instead of matching strings, it fits a $Q$ -function that predicts execution success.

Experimental Results: From Sparsity to Density

The researchers tested this on GAIA (general tasks) and Humanity's Last Exam (HLE) (hard academic reasoning).

GAIA: Accuracy jumped from 52.3% to 66.0%.
HLE: Accuracy more than doubled, from 17.9% to 38.7%.

The "HLE" results are particularly telling. Because HLE has structured domains (Biology, Humanities, etc.), a skill learned to solve one "Biology" question was frequently reused to solve others.

Learning Curves on HLE Figure: Success rates across training rounds (R0 to R3) show clear convergence as the skill library matures.

The "Muscle Memory" of Agents

As the agent learns, it fills its "embedding space" with skills. The paper uses t-SNE projections to show how a library grows from 5 atomic seeds to a dense cloud of 235 specialized skills. This "densification" reduces the "memory coverage radius," meaning the LLM has to do less "guessing" and more "following proven patterns."

Skill Embedding Growth Figure: t-SNE projection showing the evolution from sparse seed skills (red) to a dense, domain-specific library (blue).

Critical Insight: The Three Independent Knobs

The paper concludes with a powerful theoretical framework. The gap between a "perfect" agent and a current agent can be closed by turning three independent levers:

LLM Quality: Swap for a better base model to reduce generalization error.
Memory Density: Run more "episodes" to add more skills and shrink the coverage gap.
Retrieval Accuracy: Improve the router to reduce selection error.

By separating these, Memento-Skills provides an engineering-friendly roadmap for building agents that actually get smarter every day they spend in "production," without ever needing a GPU cluster for retraining.

Conclusion

Memento-Skills proves that a "frozen" LLM is not a "static" LLM. By moving the learning process from the model weights to an externalized skill library, we gain transparency, safety (via code unit tests), and continuous improvement. As one of the characters in the paper's witty dialogue notes: "The diminishing returns aren’t a bug; they’re a sign the system is converging."

发现相似论文

试试这些示例

Search for recent papers on "agent-designing agents" or "recursive self-improving agents" that utilize external code-based memory for task adaptation.
What is the relationship between the "Stateful Reflective Decision Process (SRDP)" introduced in Memento 2 and traditional Case-Based Reasoning (CBR) in AI?
Find studies that compare behavior-aligned retrieval (RL-based) versus semantic-aligned retrieval (Embedding-based) for tool and skill selection in LLM agents.

[Memento-Skills] The Agent as Architect: Achieving Continual Learning Without Gradient Descent

1. TL;DR

2. The "Frozen Brain" Problem

3. Methodology: The Read-Write Reflective Loop

3.1. 1. The Skill-Centric Architecture

3.2. 2. Behavior-Aligned Routing

4. Experimental Results: From Sparsity to Density

4.1. The "Muscle Memory" of Agents

5. Critical Insight: The Three Independent Knobs

6. Conclusion