This paper introduces IGV-RRT, a probabilistic planning framework for Object Goal Navigation (ObjectNav) in temporally changing indoor environments. It combines a 3D Scene Graph-based Information Gain Map (IGM) for global guidance with an online VLM Score Map (VLM-SM) to achieve State-of-the-Art search efficiency and success rates by correcting stale historical priors with real-time semantic evidence.
TL;DR
Navigating indoor environments is hard when humans move the furniture. IGV-RRT solves this by fusing "what we remember" (3D Scene Graphs) with "what we see right now" (VLM-based semantic scores). By integrating these into a real-time RRT planner, robots can now efficiently find relocated objects, achieving a +24.7% improvement in success rates over prior semantic mapping baselines.
Problem & Motivation: The "Static Map" Trap
Most robotic navigation systems assume the world is a museum—static and unchanging. They build a 3D Scene Graph (3DSG), map out where the "couch" and "table" are, and use these as anchors to find smaller objects.
However, in real homes, things move. If a robot's prior knowledge says the "Remote Control" is usually near the "Sofa," but the Sofa has been moved to another room, the robot gets trapped in a loop of searching an empty space. Current Vision-Language Model (VLM) approaches try to fix this by looking at every frame, but they lack global "intuition" and often waste time in redundant exploration.
Methodology: The Power of Dual-Layer Mapping
The CORE innovation of IGV-RRT is the Prior-Real-Time Observation Fusion. It doesn't choose between memory and sight; it weighs them dynamically.
1. The Information Gain Map (IGM) - The Global Memory
The robot builds a 3DSG using YOLOv7 and ConceptNet. This creates a "probability field" (using a Gaussian Mixture Model) of where a target should be based on commonsense (e.g., "Mugs are usually near Coffee Machines").
2. The VLM Score Map (VLM-SM) - The Real-Time Corroborator
As the robot moves, it uses BLIP-2 to evaluate the current view. The authors use a multi-prompt strategy (asking the VLM about the object name, its context, and its room type) to create a high-contrast semantic map. If the robot sees a high semantic score in a place the "Memory" didn't expect, the map updates to reflect the new reality.
Fig 1: Overview of the IGV-RRT pipeline showing the fusion of IGM (Prior) and VLM-SM (Real-time).
3. IGV-RRT Planning: Smart Tree Expansion
The planner evaluates candidate nodes $v$ using a joint utility function: $$U_{final}(v) = \lambda_d \cdot (1 - D(v)) + \mathbb{I}(v otin \mathcal{M}_{exp}) \cdot [ \lambda_e \cdot E(v) + \lambda_s \cdot S(v) ]$$
- $E(v)$: Information gain from the prior (Go where we think the object is).
- $S(v)$: VLM semantic support (Go where we actually see evidence).
- $\mathbb{I}(v otin \mathcal{M}_{exp})$: The "Explored-Region Gating"—a crucial mechanism that stops the robot from revisiting the same spot twice.
Experiments: Proving the Resilience
The team tested IGV-RRT in the HM3D (Habitat-Matterport 3D) simulator and on a Wheeltec R550 physical robot.
SOTA Comparison
In environments where objects were moved after the initial map was built, IGV-RRT crushed the VLFM baseline:
- Success Rate (SR): 42.9% vs. 34.4%
- Path Efficiency (SPL): 26.3% vs. 16.7%
Fig 2: Trajectory comparison. IGV-RRT (Red) uses VLM evidence to correct pathing early, while the baseline (Green) wanders aimlessly.
Ablation Insights
The study proved that VLM-SM (Semantic Score) and Explored-Region Gating (Revisit Suppression) are synergistic. Without the gating, the robot frequently gets stuck in "semantic traps," repeatedly checking a high-scoring but empty area.
Deep Insight & Conclusion
IGV-RRT succeeds because it treats historical data as a soft bias rather than a hard constraint. By mathematically weighing the entropy of the prior (IGP) against the confidence of the real-time VLM, it achieves a "Bayesian-like" balance in motion planning.
Limitations: The current IGM is still "frozen" once constructed. While the robot can ignore it, it can't yet "rewrite" its long-term memory to say "The Sofa is now in Room B."
Future Outlook: Transitioning this into Long-term Autonomy (LTA), where the robot maintains an evolving 3D Scene Graph over weeks or months, will be the next frontier in embodied AI.
Paper Reference: "IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments"
