AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

[ICML 2025] AMA-Bench: Moving Beyond Semantic Search for Long-Horizon Agent Memory

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces AMA-Bench, the first comprehensive benchmark designed to evaluate long-horizon memory in autonomous agents using real-world and synthetic trajectories. To address the limitations of existing systems, the authors propose AMA-Agent, which utilizes a Causality Graph and Tool-Augmented Retrieval to achieve a 57.22% average accuracy, outperforming current state-of-the-art baselines by 11.16%.

TL;DR

The industry is shifting from chatbots to autonomous agents, but our evaluation of their "brains" (memory) is stuck in the past. AMA-Bench bridges this gap by testing agents on raw, machine-generated trajectories rather than human chats. The study discovers that standard RAG and memory compression are failing agents, leading to the creation of AMA-Agent—a system that uses Causality Graphs and Tool-Augmented Retrieval to break the 32k token performance ceiling.

The "Dialogue Trap" in Agent Evaluation

Most current memory benchmarks (like LoCoMo or RealTalk) treat LLMs as conversationalists. However, a real-world agent (e.g., a software engineer agent or a web-navigation bot) doesn't just chat; it interacts with machine-generated environments.

The authors identify three fatal flaws in prior work:

Representation Bias: Real agents deal with ASCII tables, JSON, and code snippets, not just free-form prose.
Lack of Causality: In agents, Step A causes State B. Current benchmarks treat interactions as unconstrained linguistic flow.
Lossy Compression: Developers often summarize history to save tokens, but in agentic tasks, "summarizing" a state transition often deletes the very "needle" needed for the final task.

Comparison of Memory Paradigms

AMA-Bench: A Two-Pronged Attack

To solve this, the researchers built AMA-Bench, featuring:

Real-World Subset: 2,496 expert-curated QA pairs across 6 domains (Web, Text2SQL, Software Engineering, Gaming, etc.).
Synthetic Subset: Programmatically generated environments (TextWorld, BabyAI) that allow for infinite horizon scaling to test exactly where a model breaks.

Methodology: AMA-Agent & The Causality Graph

The benchmark results revealed a shocking truth: Existing memory systems often perform WORSE than just feeding the whole raw text into a long-context window. This is because "similarity-based retrieval" (standard RAG) is too blunt a tool for complex causal reasoning.

The authors proposed AMA-Agent, which replaces "bits of text" with a Causality Graph.

Graph Construction: Instead of just indexing chunks, it extracts objects and environment states, linking them via directed "causality edges."
Tool-Augmented Retrieval: If a standard search fails, the agent can call a Search Tool to walk the graph nodes or run a Python Script to count/aggregate occurrences across the whole history.

AMA-Agent Architecture

Key Experimental Insights

The team's findings challenge common wisdom:

Model Size isn't the Cure: Moving from a 8B to a 32B model yields tiny gains. Changing the memory architecture creates massive performance swings (up to 45% variance).
The 32k Token Wall: Long-context models are great at first, but their accuracy craters once the trajectory grows. AMA-Agent stays stable even at 128k tokens.
SOTA Achievement: AMA-Agent surpassed the strongest baseline (HippoRAG2) by over 11%, particularly excelling in State Updating and Recall.

Scalability Analysis

Critical Perspective & Future Work

The core contribution here is the realization that semantic similarity is not enough for agents. An agent doesn't need "related" text; it needs "specific" state evidence. While AMA-Agent is a massive step forward, it still struggles with State Abstraction (condensing high-level intent), which remains a difficult frontier for the industry.

Future iterations will likely look toward Lifelong Learning, where these causality graphs aren't just for one episode, but are carried across weeks of agent operation.

Verdict: If you are building an autonomous agent in 2025, stop relying solely on Vector DBs and start looking at Causality Graphs.

Find Similar Papers

Try Our Examples

Search for recent papers that utilize graph-based structures or causality modeling to enhance long-term memory in autonomous LLM agents.
Which original research introduced the "long-horizon" reasoning challenge for agents, and how does the trajectory-based evaluation in AMA-Bench improve upon earlier POMDP-based agent evaluations?
Explore if there are studies applying tool-augmented retrieval or "LLM-as-an-Operating-System" concepts (like MemGPT) specifically to multi-modal embodied AI or software engineering agent tasks.

Contents

[ICML 2025] AMA-Bench: Moving Beyond Semantic Search for Long-Horizon Agent Memory

1. TL;DR

2. The "Dialogue Trap" in Agent Evaluation

3. AMA-Bench: A Two-Pronged Attack

4. Methodology: AMA-Agent & The Causality Graph

5. Key Experimental Insights

6. Critical Perspective & Future Work