Object-Centric Point Diffusion for 3D Goal Prediction

System Overview

High-level depiction of our method. Left: Our method's inputs, which consist of the segmented object point cloud and the scene point cloud. Right: Our method's outputs, which are hierarchically generated by Global Placement Initialization followed by Local Configuration Refinement.

Abstract

Recent advances in robotic manipulation have highlighted the effectiveness of learning from demonstration. However, while end-to-end policies excel in expressivity and flexibility, they struggle both in generalizing to novel object geometries and in attaining a high degree of precision. An alternative, object-centric approach frames the task as predicting the placement pose of the target object, providing a modular decomposition of the problem. Building on this goal-prediction paradigm, we propose a hierarchical, disentangled point diffusion framework that achieves state-of-the-art performance in placement precision, multi-modal coverage, and generalization to variations in object geometries and scene configurations. Specifically, we model global scene-level placements through a novel feed-forward Dense Gaussian Mixture Model (GMM) that yields a spatially dense prior over global placements; we then model the local object-level configuration through a novel disentangled point cloud diffusion module that separately diffuses the object geometry and the placement frame, enabling precise local geometric reasoning. Interestingly, we demonstrate that our point cloud diffusion achieves substantially higher accuracy compared to prior approaches based on SE(3) diffusion, even in the context of rigid object placement. We validate our approach across a suite of challenging tasks in simulation and in the real-world on high-precision industrial insertion tasks. Furthermore, we present results on a cloth-hanging task in simulation, indicating that our method can further relax assumptions on object rigidity.

Model Architecture

Left: Our Global Placement Initialization samples a rough global position using a novel dense GMM-based prediction module, a framework that models highly multi-modal placement distributions at the scene-level. Right: Our Local Configuration Refinement then proceeds with a novel disentangled object geometry and placement frame diffusion that simultaneously allow precise and dense goal predictions.

Method Visualizations

We provide visual aids in this section to illustrate how our hierarchical goal prediction framework solve the relative placement tasks on our benchmark - the RPDiff task suite. Specifically, given the scene and object point clouds, we first visualize how our dense GMM estimates per-point mixing weights and residual vector that proposes scene grounded multi-modal placements in a feed-forward fashion; then, we visualize how our disentangledpoint diffusion separately diffuses object geometry and placement frame to generate precise and dense goal predictions.

Global Placement Initialization - Dense Gaussian Mixture Model

Given the inputs of the object point cloud () and the scene point cloud (), our Global Placement Initialization module learns a feedforward network that outputs a spatially-grounded Dense Gaussian Mixture Model (GMM). For each scene point, the network predicts both a mixing weight and a residual vector, defining the center of a Gaussian distribution anchored at that scene point. At inference time, we sample from this mixture to obtain a candidate placement location, similar in spirit to standard GMMs but tailored to the structure of each placement scene. In the following figures, we visualize the high-probability (top 90%) residual vectors () and the corresponding placement proposals () that will be most likely sampled given the scene and object point clouds.

Object-Scene Context

Dense GMM Proposals

Mug/SingleEasyRack

Mug/SingleMedRack

Mug/MultiMedRack

Book/Bookshelf

Can/Cabinet

Local Configuration Refinement - Disentangled Point Diffusion

Given the inputs of the object point cloud, the scene point cloud, and additionally the local placement frame proposed by sampling from the predicted Gaussian Mixture model, our Local Configuration Refinement module outputs the precise object placement via a novel disentangled point diffusion process that estimates two things in the initialized local frame simultaneously: (i) exactly where the object will be placed, i.e. translation, and (ii) the configuration of the object in the placement pose, i.e. rotation for a rigid object, or shape deformation for a deformable objec. Specifically, we separately diffuse the object geometry ( in Shape Diffusion) and object frame ( in Frame Diffusion) in the goal configuration in a local frame ( in Frame Diffusion), with the past frame trajectories diffused shown as well ( in Frame Diffusion). By composing the shape diffusion and frame diffusion across diffusion timesteps, we obtain the updated and final placement configuration prediction ( in Composed Reconstruction).

Shape Diffusion

Frame Diffusion

Composed Reconstruction

Mug/SingleEasyRack

Mug/SingleMedRack

Mug/MultiMedRack

Book/Bookshelf

Can/Cabinet

Real World Results - NIST Board Insertion Tasks

Insertion After Human Handoffs

We show that our method enables reliable human-robot collaborative manipulation by performing unimodal and multi-modal connector insertions after human handoffs. Specifically, a human hands a connector (Waterproof, DSUB-25, or SSD) with arbitrary pose directly to a robot arm, which then autonomously performs precise insertion into a designated socket on the NIST Assembly Task Board, based on the goal configurations predicted by our method. This demonstrates that our goal prediction framework exhibits strong generalization to unseen object configurations with random occlusions due to varying view points, while maintaining millimeter-scale precision required for successful real-world insertions.

Qualitative Results - Unimodal

Qualitative Results - Multi-modal

Insertion from Randomized Initial Poses

We further evaluate our method both qualitatively and quantitatively on unimodal and multi-modal insertions across three connectors (Waterproof, DSUB-25, and SSD), each with 20 insertion trials involving randomly sampled rotations and translations (see Appendix for details). To enable quantitative evaluation, the robot autonomously grasps the plug at a random angle variation between -10° and 10° and random translation variation between -5mm and 5mm across feasible axes, removes it from the socket, and moves to a default pose, discarding the grasp orientation. A point cloud of the plug and socket is then captured, and our and the baseline method predicts the appropriate insertion configuration without access to the original ground truth. As indicated by the results below, our real-world experiment demonstrates that our method excels in both unimodal and multi-modal insertions, while maintaining high precision required to reliably accomplish industrial tasks.

Qualitative Results - Unimodal

Our Method

TAX-Pose

Waterproof

DSUB-25

SSD

Qualitative Results - Multi-modal

Our Method

Waterproof

Quantitative Results

	Unimodal						Multimodal
	Waterproof		DSUB-25		SSD		Waterproof
	TAX-Pose	Ours	TAX-Pose	Ours	TAX-Pose	Ours	Ours
Success Rate	80% (16/20)	100% (20/20)	80% (16/20)	80% (16/20)	0% (0/20)	85% (17/20)	90%
Trans. Err. (mm)	1.04	0.72	0.93	1.00	16.18	2.75	–
Rot. Err. (°)	1.64	1.18	3.16	1.36	13.81	2.77	–

Simulation Results - RPDiff Tasks

We present additional visualizations of our hierarchical goal prediction framework on the RPDiff task suite. Specifically, we demonstrate the framework's ability to generalize across variations in object geometry and scene configuration, as well as its capacity to produce multi-modal predictions for diverse rigid placement scenarios. For each figure, the green cube visualizes the global placement initialization by our dense GMM module. The colored moving spheres visualize local configuration refinement by our disentangled diffusion module.

Generalization to Geometry and Configuration Variations

Mug/SingleEasyRack

Mug/SingleMedRack

Mug/MultiMedRack

Book/Bookshelf

Can/Cabinet

Multi-Modality in Goal Predictions

Mug/MultiMedRack

Book/Bookshelf

Can/Cabinet

Simulation Results - DEDO Tasks

Since our point cloud based formulation for goal prediction does not assume object rigidity, our method can be naturally applied to deformable objects without requiring any architecture modifications. We validate this capability on a cloth hanging task from Dynamic Environments with Deformable Objects (DEDO), where we also achieve superior performance in generalization to object geometry and scene configuration, multi-modal coverage, and placement precision compare to baseline method designed for deformable object placement tasks.

Object-Centric Point Diffusion for 3D Goal Prediction

System Overview

Abstract

Model Architecture

Method Visualizations

Global Placement Initialization - Dense Gaussian Mixture Model

Object-Scene Context

Dense GMM Proposals

Local Configuration Refinement - Disentangled Point Diffusion

Shape Diffusion

Frame Diffusion

Composed Reconstruction

Real World Results - NIST Board Insertion Tasks

Insertion After Human Handoffs

Qualitative Results - Unimodal

Qualitative Results - Multi-modal

Insertion from Randomized Initial Poses

Qualitative Results - Unimodal

Our Method

TAX-Pose

Qualitative Results - Multi-modal

Our Method

Quantitative Results

Simulation Results - RPDiff Tasks

Generalization to Geometry and Configuration Variations

Mug/SingleEasyRack

Mug/SingleMedRack

Mug/MultiMedRack

Book/Bookshelf

Can/Cabinet

Multi-Modality in Goal Predictions

Mug/MultiMedRack

Book/Bookshelf

Can/Cabinet

Simulation Results - DEDO Tasks

Generalization to Geometry and Configuration Variations

HangProcCloth

Multi-Modality in Goal Predictions

HangProcCloth