High-level depiction of our method. Left: Our method's inputs, which consist of the segmented object point cloud and the scene point cloud. Right: Our method's outputs, which are hierarchically generated by Global Placement Initialization followed by Local Configuration Refinement.
Recent advances in robotic manipulation have highlighted the effectiveness of learning from demonstration. However, while end-to-end policies excel in expressivity and flexibility, they struggle both in generalizing to novel object geometries and in attaining a high degree of precision. An alternative, object-centric approach frames the task as predicting the placement pose of the target object, providing a modular decomposition of the problem. Building on this goal-prediction paradigm, we propose a hierarchical, disentangled point diffusion framework that achieves state-of-the-art performance in placement precision, multi-modal coverage, and generalization to variations in object geometries and scene configurations. Specifically, we model global scene-level placements through a novel feed-forward Dense Gaussian Mixture Model (GMM) that yields a spatially dense prior over global placements; we then model the local object-level configuration through a novel disentangled point cloud diffusion module that separately diffuses the object geometry and the placement frame, enabling precise local geometric reasoning. Interestingly, we demonstrate that our point cloud diffusion achieves substantially higher accuracy compared to prior approaches based on SE(3) diffusion, even in the context of rigid object placement. We validate our approach across a suite of challenging tasks in simulation and in the real-world on high-precision industrial insertion tasks. Furthermore, we present results on a cloth-hanging task in simulation, indicating that our method can further relax assumptions on object rigidity.
Left: Our Global Placement Initialization samples a rough global position using a novel dense GMM-based prediction module, a framework that models highly multi-modal placement distributions at the scene-level. Right: Our Local Configuration Refinement then proceeds with a novel disentangled object geometry and placement frame diffusion that simultaneously allow precise and dense goal predictions.
We provide visual aids in this section to illustrate how our hierarchical goal prediction framework solve the relative placement tasks on our benchmark - the RPDiff task suite. Specifically, given the scene and object point clouds, we first visualize how our dense GMM estimates per-point mixing weights and residual vector that proposes scene grounded multi-modal placements in a feed-forward fashion; then, we visualize how our disentangledpoint diffusion separately diffuses object geometry and placement frame to generate precise and dense goal predictions.
Given the inputs of the object point cloud () and the scene point cloud (), our Global Placement Initialization module learns a feedforward network that outputs a spatially-grounded Dense Gaussian Mixture Model (GMM). For each scene point, the network predicts both a mixing weight and a residual vector, defining the center of a Gaussian distribution anchored at that scene point. At inference time, we sample from this mixture to obtain a candidate placement location, similar in spirit to standard GMMs but tailored to the structure of each placement scene. In the following figures, we visualize the high-probability (top 90%) residual vectors () and the corresponding placement proposals () that will be most likely sampled given the scene and object point clouds.
Given the inputs of the object point cloud, the scene point cloud, and additionally the local placement frame proposed by sampling from the predicted Gaussian Mixture model, our Local Configuration Refinement module outputs the precise object placement via a novel disentangled point diffusion process that estimates two things in the initialized local frame simultaneously: (i) exactly where the object will be placed, i.e. translation, and (ii) the configuration of the object in the placement pose, i.e. rotation for a rigid object, or shape deformation for a deformable objec. Specifically, we separately diffuse the object geometry ( in Shape Diffusion) and object frame ( in Frame Diffusion) in the goal configuration in a local frame ( in Frame Diffusion), with the past frame trajectories diffused shown as well ( in Frame Diffusion). By composing the shape diffusion and frame diffusion across diffusion timesteps, we obtain the updated and final placement configuration prediction ( in Composed Reconstruction).
We show that our method enables reliable human-robot collaborative manipulation by performing unimodal and multi-modal connector insertions after human handoffs. Specifically, a human hands a connector (Waterproof, DSUB-25, or SSD) with arbitrary pose directly to a robot arm, which then autonomously performs precise insertion into a designated socket on the NIST Assembly Task Board, based on the goal configurations predicted by our method. This demonstrates that our goal prediction framework exhibits strong generalization to unseen object configurations with random occlusions due to varying view points, while maintaining millimeter-scale precision required for successful real-world insertions.
We further evaluate our method both qualitatively and quantitatively on unimodal and multi-modal insertions across three connectors (Waterproof, DSUB-25, and SSD), each with 20 insertion trials involving randomly sampled rotations and translations (see Appendix for details). To enable quantitative evaluation, the robot autonomously grasps the plug at a random angle variation between -10° and 10° and random translation variation between -5mm and 5mm across feasible axes, removes it from the socket, and moves to a default pose, discarding the grasp orientation. A point cloud of the plug and socket is then captured, and our and the baseline method predicts the appropriate insertion configuration without access to the original ground truth. As indicated by the results below, our real-world experiment demonstrates that our method excels in both unimodal and multi-modal insertions, while maintaining high precision required to reliably accomplish industrial tasks.
Unimodal | Multimodal | |||||||
---|---|---|---|---|---|---|---|---|
Waterproof | DSUB-25 | SSD | Waterproof | |||||
TAX-Pose | Ours | TAX-Pose | Ours | TAX-Pose | Ours | Ours | ||
Success Rate | 80% (16/20) | 100% (20/20) | 80% (16/20) | 80% (16/20) | 0% (0/20) | 85% (17/20) | 90% | |
Trans. Err. (mm) | 1.04 | 0.72 | 0.93 | 1.00 | 16.18 | 2.75 | – | |
Rot. Err. (°) | 1.64 | 1.18 | 3.16 | 1.36 | 13.81 | 2.77 | – |
We present additional visualizations of our hierarchical goal prediction framework on the RPDiff task suite. Specifically, we demonstrate the framework's ability to generalize across variations in object geometry and scene configuration, as well as its capacity to produce multi-modal predictions for diverse rigid placement scenarios. For each figure, the green cube visualizes the global placement initialization by our dense GMM module. The colored moving spheres visualize local configuration refinement by our disentangled diffusion module.
Since our point cloud based formulation for goal prediction does not assume object rigidity, our method can be naturally applied to deformable objects without requiring any architecture modifications. We validate this capability on a cloth hanging task from Dynamic Environments with Deformable Objects (DEDO), where we also achieve superior performance in generalization to object geometry and scene configuration, multi-modal coverage, and placement precision compare to baseline method designed for deformable object placement tasks.