This paper presents a large-scale systematic study (13,000+ real-world rollouts) on action space design for imitation-based robotic manipulation. It identifies "Chunk-wise Delta" in "Joint Space" as the superior configuration for standard policy learning, while highlighting that "Task Space" remains better for cross-embodiment generalization.
In the current era of "Scaling Laws" for robotics, the community has fixated on data volume and model parameters. However, a critical bottleneck has been hiding in plain sight: the Action Space. While we've treated the choice between Joint angles and End-Effector (EEF) poses as a trivial implementation detail, this paper proves it is a decisive factor in policy learnability and deployment stability.
TL;DR
Based on a massive empirical study of 13,000+ real-world rollouts, the authors uncover that how a robot predicts its next move (Delta vs. Absolute) and where it predicts it (Joint vs. Task space) creates a massive performance gap. The gold standard? Chunk-wise Delta actions in Joint Space for performance, and Task Space for generalization.
The Hidden Complexity: Why "Ad-Hoc" Fails
Most researchers pick an action space based on the codebase they inherited. But the interface between neural predictions and physical hardware is the primary supervision signal.
The authors identify two core axes of choice:
- Spatial Abstraction: Joint Space (motor angles) vs. Task Space (3D Cartesian coordinates). Joint space is robust but non-linear; Task space is intuitive but relies on fragile Inverse Kinematics (IK) solvers.
- Temporal Abstraction: Absolute (predicting the final goal) vs. Delta (predicting the increment).
Prior works have been fragmented. For instance, Diffusion Policy typically uses Task-space Delta, while ACT often uses Joint-space Absolute. This paper finally puts them into a controlled Arena.
Methodology: Dissecting the Mapping
The authors formalize the action generation pipeline into a two-stage process: Temporal Decoding followed by Spatial Projection.
1. The "Chunk-wise Delta" Breakthrough
Action chunking (predicting a sequence of future actions) is standard. However, the authors found a massive difference between "Step-wise Delta" (relative to the previous step in the chunk) and "Chunk-wise Delta" (all steps relative to the state at the start of the chunk).
Mathematically, they prove that Step-wise Delta amplifies noise linearly with the horizon , while Chunk-wise Delta maintains a constant error bound .
Figure: The hierarchy of action abstraction taxonomy.
2. The Horizon-Abstraction Coupling
A key insight is that the execution horizon should not be a constant.
- Absolute Control thrives with longer horizons because it allows for global spatial grounding.
- Delta Control requires shorter horizons to facilitate rapid error correction and prevent execution drift.
Experimental Results: The Hierarchy of Performance
The team conducted experiments across AgileX and AIRBOT platforms, and the RoboTwin 2.0 simulation.
Key Finding 1: Delta is King
Across almost all tasks (Pick, Place, Cube Transfer), Delta actions consistently outperformed Absolute actions. Why? Local displacements provide a more tractable inductive bias for neural networks than mapping raw pixels to global coordinates.
Key Finding 2: Joint Space + Generative Modeling = SOTA
If you use powerful generative backbones (like Flow Matching or Diffusion), the "complexity" of the Joint-space manifold disappears, and the inherent stability of direct motor control takes over.
Table: Quantitative comparison across different embodiments.
Scaling and Generalization: The Plot Twist
When the researchers scaled the data and compute, the superiority of Joint-space became even more pronounced. However, in Advanced Learning Regimes (Cross-embodiment and Transfer Learning from foundation models like ), the trend flipped: Task-space became the winner.
The Logic: Task-space is "robot-agnostic." A hand moving to a cup is a hand moving to a cup, regardless of whether the robot arm has 6 or 7 joints. This makes it the ideal interface for Generalist Foundation Models.
Critical Insights & Practical Guidelines
For practitioners, the paper provides a clear roadmap:
- Single-Platform Optimization: Use Joint Space + Chunk-wise Delta.
- Foundation Model Pre-training: Use Task Space (EEF) to maximize transferability.
- Stability First: Avoid Step-wise integration at all costs—it is a recipe for error accumulation.
Limitations: The study focuses on rigid taxonomies. The authors suggest that the "Holy Grail" might be Hybrid Representations—dynamic switching between task-space for reaching and joint-space for contact-rich manipulation.
Conclusion
This work transitions action space design from "art" to "science." By proving that the control interface is a decisive configuration rather than a minor detail, it sets a new standard for how we should build and evaluate robot learning policies.
