E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation

[CVPR 2025] E-3DPSM: Rethinking Egocentric 3D Pose Estimation via Continuous State Machines

总结

问题

方法

结果

要点

摘要

The paper introduces E-3DPSM, a real-time event-driven continuous pose state machine for egocentric 3D human pose estimation using a monocular head-mounted event camera. It leverages State Space Models (S5) and a learnable Kalman-style fusion module to achieve state-of-the-art accuracy, improving MPJPE by up to 19% while drastically increasing temporal stability by 2.7x over existing methods like EventEgo3D++.

TL;DR

Egocentric 3D human pose estimation from Head-Mounted Devices (HMDs) is notoriously difficult due to extreme self-occlusions and motion blur. E-3DPSM breaks away from traditional frame-based processing by modeling human motion as a continuous state evolution. By fusing event-driven "delta" updates with direct 3D predictions through a learnable Kalman filter, it achieves up to 19% better accuracy and 2.7x higher temporal stability than previous SOTA, running at a lightning-fast 80 Hz.

The Core Conflict: Why Events?

Standard RGB cameras fail in AR/VR contexts when the user moves quickly (motion blur) or enters low-light environments. Event cameras, which only record per-pixel brightness changes, offer microsecond resolution and high dynamic range. However, previous attempts (like EventEgo3D) treated events like "sparse images," missing the physics of the stream.

The authors identify three fatal flaws in prior work:

Quantization Errors: Reliance on 2D heatmaps.
Short-term Memory: Using only a single previous frame for context.
Information Mismatch: Predicting absolute poses from a sensor that inherently captures change.

Methodology: The State Machine Philosophy

E-3DPSM introduces a three-stage pipeline that aligns the mathematical properties of the sensor with the physics of human motion.

1. Spatiotemporal Pose Encoder (SPEM)

Instead of simple CNNs, SPEM uses State Space Models (SSM)—specifically the S5 variant—to maintain a latent state $Z_{t}$ . This allows the model to "remember" body parts even when they are occluded for long durations. Deformable Attention is also utilized to handle the extreme distortions caused by fisheye lenses used in egocentric setups.

Model Architecture Figure 1: The E-3DPSM pipeline converts events into Locally Normalized Event Surfaces (LNES), encodes them via SSMs, and regresses poses.

2. Learnable Neural Kalman Fusion

This is the "secret sauce." The model predicts two things:

Direct Pose ( $P^{D}$ ): A global anchor of where the body is.
Delta Pose ( $P^{Δ}$ ): How much each joint moved since the last event.

A differentiable Kalman-style filter then fuses these. The "Delta" provides smooth, jitter-free motion, while the "Direct" pose prevents the integration drift that plagues pure dead-reckoning systems.

Experiments: Crushing the Jitter

The results on the EE3D-R (Real) and EE3D-W (Wild) benchmarks are striking. Most notably, the "eSmooth" metric (which measures temporal jitter) plummeted.

| Method | MPJPE ↓ | eSmooth ↓ | | :--- | :--- | :--- | | EventEgo3D++ (Previous SOTA) | 103.28 | 22.93 | | E-3DPSM (Ours) | 84.45 | 8.40 |

The improvements are most visible in extreme poses. While previous models "lose" the legs during crawling or crouching, E-3DPSM maintains anatomical plausibility.

Experimental Results Figure 2: Qualitative comparison showing E-3DPSM's stability (Red) vs Ground Truth (Green) in complex "In-the-Wild" scenarios.

Critical Insight: The "Delta" Advantage

The paper proves that a sensor that measures change is naturally suited for predicting velocity (delta pose). By supervising the model with a specific Delta Pose Loss ( $L_{Δ}$ ), the network learns to interpret the "flash" of events as a direct measurement of joint displacement. This reduces the problem complexity from "Where is the hand in 3D space?" to "How much did the hand move given these 5,000 events?"

Conclusion & Future Impact

E-3DPSM represents a shift toward event-native architectures. By avoiding the intermediate 2D heatmap proxy and embracing continuous-time state modeling, it provides a blueprint for future AR/VR interfaces. However, the authors note sensitivity to extremely dense environments where background noise might overwhelm the human signal—a remaining frontier for event-based vision.

Key Values:

Real-time Efficiency: 80Hz (A6000) / 52Hz (Mobile 3050Ti).
Zero Synthetic Pre-training: Unlike predecessors, it trains directly on real data.
Robustness: Significant gains in distal joints (wrists/ankles) under occlusion.

发现相似论文

试试这些示例

Search for recent papers that apply State Space Models (SSMs) or Mamba-based architectures to 3D human pose estimation or motion tracking tasks.
Which original research introduced the concept of learnable Kalman filters or differentiable filters for neural network-based state estimation, and how does this paper's fusion module differ?
Find studies that explore the use of event cameras for other egocentric tasks like SLAM or hand-object interaction to see if continuous state machines are being adopted there.

[CVPR 2025] E-3DPSM: Rethinking Egocentric 3D Pose Estimation via Continuous State Machines

1. TL;DR

2. The Core Conflict: Why Events?

3. Methodology: The State Machine Philosophy

3.1. 1. Spatiotemporal Pose Encoder (SPEM)

3.2. 2. Learnable Neural Kalman Fusion

4. Experiments: Crushing the Jitter

5. Critical Insight: The "Delta" Advantage

6. Conclusion & Future Impact