The paper introduces AMA-Bench, the first comprehensive benchmark designed to evaluate long-horizon memory in autonomous agents using real-world and synthetic trajectories. To address the limitations of existing systems, the authors propose AMA-Agent, which utilizes a Causality Graph and Tool-Augmented Retrieval to achieve a 57.22% average accuracy, outperforming current state-of-the-art baselines by 11.16%.
TL;DR
The industry is shifting from chatbots to autonomous agents, but our evaluation of their "brains" (memory) is stuck in the past. AMA-Bench bridges this gap by testing agents on raw, machine-generated trajectories rather than human chats. The study discovers that standard RAG and memory compression are failing agents, leading to the creation of AMA-Agent—a system that uses Causality Graphs and Tool-Augmented Retrieval to break the 32k token performance ceiling.
The "Dialogue Trap" in Agent Evaluation
Most current memory benchmarks (like LoCoMo or RealTalk) treat LLMs as conversationalists. However, a real-world agent (e.g., a software engineer agent or a web-navigation bot) doesn't just chat; it interacts with machine-generated environments.
The authors identify three fatal flaws in prior work:
- Representation Bias: Real agents deal with ASCII tables, JSON, and code snippets, not just free-form prose.
- Lack of Causality: In agents, Step A causes State B. Current benchmarks treat interactions as unconstrained linguistic flow.
- Lossy Compression: Developers often summarize history to save tokens, but in agentic tasks, "summarizing" a state transition often deletes the very "needle" needed for the final task.

AMA-Bench: A Two-Pronged Attack
To solve this, the researchers built AMA-Bench, featuring:
- Real-World Subset: 2,496 expert-curated QA pairs across 6 domains (Web, Text2SQL, Software Engineering, Gaming, etc.).
- Synthetic Subset: Programmatically generated environments (TextWorld, BabyAI) that allow for infinite horizon scaling to test exactly where a model breaks.
Methodology: AMA-Agent & The Causality Graph
The benchmark results revealed a shocking truth: Existing memory systems often perform WORSE than just feeding the whole raw text into a long-context window. This is because "similarity-based retrieval" (standard RAG) is too blunt a tool for complex causal reasoning.
The authors proposed AMA-Agent, which replaces "bits of text" with a Causality Graph.
- Graph Construction: Instead of just indexing chunks, it extracts objects and environment states, linking them via directed "causality edges."
- Tool-Augmented Retrieval: If a standard search fails, the agent can call a Search Tool to walk the graph nodes or run a Python Script to count/aggregate occurrences across the whole history.

Key Experimental Insights
The team's findings challenge common wisdom:
- Model Size isn't the Cure: Moving from a 8B to a 32B model yields tiny gains. Changing the memory architecture creates massive performance swings (up to 45% variance).
- The 32k Token Wall: Long-context models are great at first, but their accuracy craters once the trajectory grows. AMA-Agent stays stable even at 128k tokens.
- SOTA Achievement: AMA-Agent surpassed the strongest baseline (HippoRAG2) by over 11%, particularly excelling in State Updating and Recall.

Critical Perspective & Future Work
The core contribution here is the realization that semantic similarity is not enough for agents. An agent doesn't need "related" text; it needs "specific" state evidence. While AMA-Agent is a massive step forward, it still struggles with State Abstraction (condensing high-level intent), which remains a difficult frontier for the industry.
Future iterations will likely look toward Lifelong Learning, where these causality graphs aren't just for one episode, but are carried across weeks of agent operation.
Verdict: If you are building an autonomous agent in 2025, stop relying solely on Vector DBs and start looking at Causality Graphs.
