ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting

Wenhao Hu1,2, Haonan Zhou1, Liu Liu2, Yun Du1, Xingjie Wang1, Ziang Li1, Zhizhong Su2, Gaoang Wang1
1Zhejiang University
2Horizon Robotics

Trajectory Data Augmentation

Original Demonstration

Overview

Teaser Figure

High-Fidelity Real2Sim Alignment and Trajectory Augmentation. (Left) Our framework decomposes monocular ego-view video into aligned assets (URDF robot, objects, and background). (Right) This enables a trajectory augmentation pipeline that synthesizes diverse variations from single demonstrations to scale up data for policy learning.

Abstract

Reconstructing dynamic and interactive 3D scenes from real-world observations remains a fundamental challenge in computer vision and robotics. While recent advances in 3D Gaussian Splatting have enabled high-fidelity static reconstruction, extending it to interactive environments with articulated robots and manipulable objects remains difficult due to complex contact interactions and abrupt pose changes. To address these challenges, we introduce ManiSplat, a unified framework that reconstructs controllable and decoupled Gaussian digital twins directly from monocular ego-view robotic videos. Our method introduces a Graph-Structured Disentangled Representation that separates the robot, objects, and background into independently optimizable Gaussian subfields organized within a scene graph. To ensure stability, we propose a Task-Oriented Spatio-Temporal Alignment module that leverages the inherent logic of manipulation tasks—alternating between Motion and Skill phases—to construct accurate pseudo-ground-truth trajectories. Finally, a joint photometric-geometric optimization ensures the reconstructed scenes are temporally coherent, physically consistent, and simulation-ready. Extensive experiments demonstrate that our approach reconstructs interaction-driven dynamic scenes with high fidelity and controllability, effectively supporting downstream robotic tasks and policy learning.

Method

Method Overview

Overview of the Proposed Framework. (Left) We construct a Gaussian Scene Graph ($\mathcal{G}$) to disentangle the scene into independent semantic nodes: the robot $\mathcal{G}_{robot}$, the object $\mathcal{G}_{obj}$, and the static background $\mathcal{G}_{bg}$. Simultaneously, the manipulation task is segmented into a robot-centric Motion phase (orange, e.g., transport) and an object-centric Skill phase (blue, e.g., insertion). (Middle) To build a high-fidelity digital twin, we perform joint optimization via Pose Alignment (tracking kinematic and visual trajectories) and Appearance Alignment (minimizing the photometric loss between the rendered Gaussian splats and real-world observations). (Right) Leveraging the decoupled structure, we perform Trajectory Synthesis by generating diverse augmented approach paths (dashed orange lines) that seamlessly converge into the preserved skill execution, significantly scaling up the demonstration data.

Results

Results

Figure 3. Qualitative comparison of dynamic reconstruction. 4DGS and DeformableGS exhibit artifacts and blurring during rapid manipulator movements due to the lack of decoupling. HUGS produces blurry object details by over-smoothing the abrupt pose changes during grasping. In contrast, our method maintains sharp boundaries and stable appearance by explicitly modeling the interaction stages.

BibTeX

@misc{hu2026manisplatmanipulationtrajectorysynthesis,
        title={ManiSplat: Manipulation Trajectory Synthesis from Monocular Video via Decoupled 3D Gaussian Splatting}, 
        author={Wenhao Hu and Haonan Zhou and Liu Liu and Yun Du and Xinjie Wang and Ziang Li and Zhizhong Su and Gaoang Wang},
        year={2026},
        eprint={2606.10645},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2606.10645}, 
  }