WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
OmniBehavior: Challenging LLMs with Real-World, Long-Horizon Human Behavior
总结
问题
方法
结果
要点
摘要

The paper introduces OmniBehavior, the first large-scale benchmark for general-purpose user simulation based entirely on real-world data from the Kuaishou platform. It covers long-horizon (up to 3 months), cross-scenario (5 distinct domains), and heterogeneous (22 action types) behavioral traces to evaluate the ability of Large Language Models (LLMs) to simulate authentic human decision-making.

Is it possible for a Large Language Model (LLM) to act as a digital twin for a human user? While we've seen LLMs excel at coding, reasoning, and creative writing, a new research paper introduces OmniBehavior, a benchmark that suggests our current "frontier" models are still quite far from capturing the messy, complex, and often negative reality of human digital life.

TL;DR

The authors present OmniBehavior, the first benchmark for user simulation built entirely on real-world data from the Kuaishou platform. By tracking 200 users over three months across five different scenarios (video, e-commerce, live streaming, etc.), they reveal that even the most advanced LLMs (like GPT-4o and Claude 3.5) fail to accurately predict user actions. The study uncovers a "positivity-and-average" bias: LLMs tend to be too "nice," too active, and too similar to one another compared to the diverse, sometimes grumpy, real-world humans they are meant to simulate.

Why Current Simulators Suffer from "Tunnel Vision"

Most existing research in user simulation relies on synthetic data or focuses on a single "silo"—for example, just predicting what a user will buy on an e-commerce site. However, real life doesn't happen in a silo. A user might see a video about a new tech gadget on Tuesday, search for reviews on Wednesday, and finally make a purchase via a live stream on Saturday.

The researchers found that 80% of human decision-making paths span multiple scenarios and extend over several days. Without this "long-horizon" context, a simulator is essentially performing "causal amputation"—trying to predict an outcome without seeing the weeks of influence that led to it.

Overview of OmniBehavior Fig 1: The OmniBehavior pipeline—from raw Kuaishou logs to a unified, multi-scenario simulation benchmark.

Methodology: Mining the Digital Footprint

The creators of OmniBehavior didn't just generate "fake" users; they used real, anonymized data from Kuaishou (a major short-video and at-commerce platform).

  1. Diverse Scenarios: They integrated data from Video Browsing, Live Streaming, Advertising, E-commerce, and Search.
  2. Long-Horizon Traces: They tracked 200 users over 3 months, resulting in sequences averaging over 8,000 actions.
  3. Heterogeneous Actions: The benchmark includes 22 different action types, from "likes" and "shares" to "purchases" and "customer service inquiries."

This allows for a user-conditioned prediction task: can an LLM, given a user's profile and 3 months of history, predict what that user will do next in a specific context?

The "Positivity-and-Average" Problem

The results were a wake-up call. Even the best model, Claude-Opus-4.5, only scored 44.55/100. The researchers identified three critical failures in LLM simulation:

  1. Hyper-activity: LLMs are "over-eager." They predict that users will "like" or "share" content far more often than they actually do. While real users are stingy with their engagement, LLMs think everyone is a "super-fan."
  2. Utopian Bias: Because of safety alignment (RLHF), LLMs are trained to be helpful and polite. This makes them terrible at simulating a frustrated user complaining to customer service. Where a real human might be blunt or angry, the LLM-simulated version remains unnervingly polite.
  3. Persona Homogenization: In the real world, people have vastly different "digital personalities." However, the LLM simulations tended to converge. When the researchers mapped user behaviors into a vector space, the real users were distinct "islands," while the LLM-simulated users were a muddled, singular "continent."

Experimental Results Table Table 1: Performance of various SOTA models. Note that even the "smartest" models struggle significantly with binary behavior (like/dislike) prediction.

The Limits of Long Context

One might think that giving the model more history (larger context windows) would fix the problem. Surprisingly, the study showed that performance plateaued or even slightly declined once the context exceeded 32k tokens. Simply "throwing more tokens" at the model doesn't help if it can't reason across those tokens to find the relevant causal links. Standard RAG (Retrieval-Augmented Generation) and summarization techniques also offered only marginal improvements.

Conclusion and Future Directions

OmniBehavior serves as a reminder that "human-like" text is not the same as "human-like" behavior. To build truly effective user simulators—which are essential for testing new apps, recommender systems, and economic models—we must find ways to:

  • Incorporate negative and "long-tail" behaviors.
  • Break through the "politeness filter" for simulation purposes.
  • Improve how models weigh old versus new information in a user's history.

For now, if you're looking for a realistic simulation of a disgruntled customer or an apathetic scroller, your best bet is still a real human.

发现相似论文

试试这些示例

  • Search for recent studies or benchmarks that address the "positivity bias" or "utopian bias" in Large Language Models when simulating human social or economic behavior.
  • Which papers first introduced the concept of "user simulation" using LLMs, and how does the OmniBehavior framework evolve the methodology from previous synthetic or single-scenario models?
  • Investigate research that applies cross-scenario behavioral modeling (e.g., connecting social media usage to e-commerce purchases) to improve personalized recommendation systems.
目录
OmniBehavior: Challenging LLMs with Real-World, Long-Horizon Human Behavior
1. TL;DR
2. Why Current Simulators Suffer from "Tunnel Vision"
3. Methodology: Mining the Digital Footprint
4. The "Positivity-and-Average" Problem
5. The Limits of Long Context
6. Conclusion and Future Directions