WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[arXiv 2026] PEARL: Transforming VLMs into Personalized Streaming AI Assistants
总结
问题
方法
结果
要点
摘要

This paper introduces Personalized Streaming Video Understanding (PSVU), a novel task requiring models to recognize and reason about user-defined concepts in real-time video streams. The authors present PEARL, a training-free, plug-and-play framework, and PEARL-Bench, a comprehensive benchmark where PEARL achieves SOTA results, boosting performance by up to 23.47%.

TL;DR

The AI community has long sought a "Siri with eyes"—an assistant that remembers who your friends are and recognizes your custom gym routines in real-time. PEARL (Personalized Streaming Video Understanding Model) bridges this gap. By introducing the PSVU task and a training-free, dual-memory framework, PEARL allows off-the-shelf Vision-Language Models (VLMs) to register new concepts on-the-fly and recall them across hours of video stream with SOTA precision.

The Motivation: Why Your AI is "Memory-Blind"

Current multimodal models suffer from "Offline Bias." Even the most advanced VLMs (like LLaVA or GPT-4V) typically process video as a static, pre-recorded file. In the real world, human cognition is a streaming process: we meet someone once (Registration), and hours later, we recognize them in a crowd (Retrieval).

Existing methods fail here because:

  1. Inefficient Memory: They try to "cram" the whole video into a limited context window.
  2. Fixed Concepts: They can't learn a "unique" object or action (e.g., your specific coffee mug) without fine-tuning.

Methodology: The Dual-Grained Memory Architecture

PEARL solves the memory bottleneck by separating what a thing is from when it happened.

PEARL Framework Architecture

1. Dual-grained Memory System

  • Concept Memory: Stores user-defined identities (Frame-level) and actions (Video-level). When you say "This is my dog, Buster," PEARL generates a compact, stable textual description (e.g., "a golden retriever with a blue collar") and stores it.
  • Streaming Memory: A rolling archive of the video stream. It segments the video into clips and stores them as compact multimodal embeddings for fast retrieval.

2. Concept-aware Retrieval

The "Secret Sauce" of PEARL is Query Rewriting. If you ask, "What was Buster doing 10 minutes ago?", the model doesn't just search for the word "Buster" (which the embedding model doesn't know). It rewrites the query using its Concept Memory to: "What was [the golden retriever with a blue collar] doing?" This allows the system to find the exact historical clip using standard semantic search.


PEARL-Bench: A New North Star for Streaming AI

To test this, the authors created PEARL-Bench, the first benchmark for streaming, multi-turn, personalized video QA. It covers:

  • Frame-level: Tracking specific entities.
  • Video-level: Understanding personalized actions (e.g., a specific custom dance move).

Comparison of Benchmarks


Experiments & Results: SOTA Without Training

PEARL was tested against 8 major models. The results are striking:

  • Massive Accuracy Boost: On Qwen3-VL-8B, PEARL boosted performance by 23.47% over the base offline model.
  • Efficiency: Despite the retrieval overhead, PEARL maintains a latency low enough for real-time interaction (~775ms for LLaVA-OV-7B), proving it's ready for deployment.

Experimental Results

Key Insight from Ablation Study

The Ablation Study (Table 4 in the paper) highlights that Concept Memory is the most critical component. Without it, models hover near random guess levels for real-time QA. Adding Streaming Memory is what unlocks "Past-Time" reasoning—the ability to answer questions about things that happened minutes or hours ago.


Critical Analysis & Future Outlook

Why it works: PEARL effectively turns the "personalization" problem into a "retrieval" problem. By using a small, specialized embedding model to bridge user-defined names and visual descriptions, it bypasses the need for the large VLM to "learn" through weights.

Limitations:

  • Description Quality: The system relies on the VLM's ability to generate an accurate initial description. If the registration frame is blurry, the "Concept Memory" might be flawed.
  • Scene Segmentation: It depends on PySceneDetect. In videos with very slow transitions or "one-shot" long takes, the clip segmentation might struggle.

The Future: PEARL sets the stage for AI assistants that truly live with us. Imagine a smart home camera that doesn't just see "a person" but knows "The person who is doing the [custom exercise move] you defined yesterday."

This work moves us from General AI to Personal AI.

发现相似论文

试试这些示例

  • Search for recent papers on streaming video understanding that incorporate Retrieval-Augmented Generation (RAG) for long-term memory.
  • Which studies first introduced the concept of "training-free personalization" for Vision-Language Models, and how does PEARL's dual-memory approach differ?
  • Investigate how the PSVU task and PEARL framework could be extended to multi-modal sensor fusion in robotics or autonomous AI agents.
目录
[arXiv 2026] PEARL: Transforming VLMs into Personalized Streaming AI Assistants
1. TL;DR
2. The Motivation: Why Your AI is "Memory-Blind"
3. Methodology: The Dual-Grained Memory Architecture
3.1. 1. Dual-grained Memory System
3.2. 2. Concept-aware Retrieval
4. PEARL-Bench: A New North Star for Streaming AI
5. Experiments & Results: SOTA Without Training
5.1. Key Insight from Ablation Study
6. Critical Analysis & Future Outlook