This paper introduces Personalized Streaming Video Understanding (PSVU), a novel task requiring models to recognize and reason about user-defined concepts in real-time video streams. The authors present PEARL, a training-free, plug-and-play framework, and PEARL-Bench, a comprehensive benchmark where PEARL achieves SOTA results, boosting performance by up to 23.47%.
TL;DR
The AI community has long sought a "Siri with eyes"—an assistant that remembers who your friends are and recognizes your custom gym routines in real-time. PEARL (Personalized Streaming Video Understanding Model) bridges this gap. By introducing the PSVU task and a training-free, dual-memory framework, PEARL allows off-the-shelf Vision-Language Models (VLMs) to register new concepts on-the-fly and recall them across hours of video stream with SOTA precision.
The Motivation: Why Your AI is "Memory-Blind"
Current multimodal models suffer from "Offline Bias." Even the most advanced VLMs (like LLaVA or GPT-4V) typically process video as a static, pre-recorded file. In the real world, human cognition is a streaming process: we meet someone once (Registration), and hours later, we recognize them in a crowd (Retrieval).
Existing methods fail here because:
- Inefficient Memory: They try to "cram" the whole video into a limited context window.
- Fixed Concepts: They can't learn a "unique" object or action (e.g., your specific coffee mug) without fine-tuning.
Methodology: The Dual-Grained Memory Architecture
PEARL solves the memory bottleneck by separating what a thing is from when it happened.

1. Dual-grained Memory System
- Concept Memory: Stores user-defined identities (Frame-level) and actions (Video-level). When you say "This is my dog, Buster," PEARL generates a compact, stable textual description (e.g., "a golden retriever with a blue collar") and stores it.
- Streaming Memory: A rolling archive of the video stream. It segments the video into clips and stores them as compact multimodal embeddings for fast retrieval.
2. Concept-aware Retrieval
The "Secret Sauce" of PEARL is Query Rewriting. If you ask, "What was Buster doing 10 minutes ago?", the model doesn't just search for the word "Buster" (which the embedding model doesn't know). It rewrites the query using its Concept Memory to: "What was [the golden retriever with a blue collar] doing?" This allows the system to find the exact historical clip using standard semantic search.
PEARL-Bench: A New North Star for Streaming AI
To test this, the authors created PEARL-Bench, the first benchmark for streaming, multi-turn, personalized video QA. It covers:
- Frame-level: Tracking specific entities.
- Video-level: Understanding personalized actions (e.g., a specific custom dance move).

Experiments & Results: SOTA Without Training
PEARL was tested against 8 major models. The results are striking:
- Massive Accuracy Boost: On Qwen3-VL-8B, PEARL boosted performance by 23.47% over the base offline model.
- Efficiency: Despite the retrieval overhead, PEARL maintains a latency low enough for real-time interaction (~775ms for LLaVA-OV-7B), proving it's ready for deployment.

Key Insight from Ablation Study
The Ablation Study (Table 4 in the paper) highlights that Concept Memory is the most critical component. Without it, models hover near random guess levels for real-time QA. Adding Streaming Memory is what unlocks "Past-Time" reasoning—the ability to answer questions about things that happened minutes or hours ago.
Critical Analysis & Future Outlook
Why it works: PEARL effectively turns the "personalization" problem into a "retrieval" problem. By using a small, specialized embedding model to bridge user-defined names and visual descriptions, it bypasses the need for the large VLM to "learn" through weights.
Limitations:
- Description Quality: The system relies on the VLM's ability to generate an accurate initial description. If the registration frame is blurry, the "Concept Memory" might be flawed.
- Scene Segmentation: It depends on
PySceneDetect. In videos with very slow transitions or "one-shot" long takes, the clip segmentation might struggle.
The Future: PEARL sets the stage for AI assistants that truly live with us. Imagine a smart home camera that doesn't just see "a person" but knows "The person who is doing the [custom exercise move] you defined yesterday."
This work moves us from General AI to Personal AI.
