Pick-and-place Manipulation Cross Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach

1School of Computation, Information and Technology, Technical University of Munich, Germany
2Hong Kong University of Science and Technology, China
3State Key Laboratory for Novel Software Technology, Nanjing University, China
4Sun Yat-sen University, Guangzhou, China
5Tsinghua University, China
Corresponding author: zhenshan.bing@tum.de

Abstract

Current robotic pick-and-place policies typically require consistent gripper configurations across training and inference. This constraint imposes high retraining or fine-tuning costs, especially for imitation learning-based approaches, when adapting to new end-effectors. To mitigate this issue, we present a diffusion-based policy with a hybrid learning-optimization framework, enabling zero-shot adaptation to novel grippers without additional data collection for retraining policy. During training, the policy learns manipulation primitives from demonstrations collected using a base gripper. At inference, a diffusion-based optimization strategy dynamically enforces kinematic and safety constraints, ensuring that generated trajectories align with the physical properties of unseen grippers. This is achieved through a constrained denoising procedure that adapts trajectories to gripper-specific parameters (e.g., tool-center-point offsets, jaw widths) while preserving collision avoidance and task feasibility. We validate our method on a Franka Panda robot across six gripper configurations, including 3D-printed fingertips, flexible silicone gripper, and Robotiq 2F-85 gripper. Our approach achieves a 93.3% average task success rate across grippers (vs. 23.3-26.7% for diffusion policy baselines), supporting tool-center-point variations of 16-23.5 cm and jaw widths of 7.5-11.5 cm. The results demonstrate that constrained diffusion enables robust cross-gripper manipulation while maintaining the sample efficiency of imitation learning, eliminating the need for gripper-specific retraining.

GADP

Framework Overview

Pick-and-place manipulation cross grippers without retraining is a diffusion-based policy for transferring pick-and-place knowledge across different grippers. This knowledge transition does not require retraining or fine-turning the policy with the new gripper’s configuration. Instead, it only needs to introduce the configuration in the policy inference phase and make the generated trajectories satisfy safety constraints, ensuring the successful completion of pick-and-place tasks.

Interpolate start reference image.

Fig. 1: Framework of GADP. (a) The multi-modal observation consists of robot pose Srot', scene point clouds Ssce, and grasping probability map Gprob*. Gripper morphological variations are encoded into Srot' via gripper mapping. (b) Safety-constrained trajectory projection via online optimization process, enforcing the executive trajectory to satisfy task and safety constraints.

We demonstrate the robustness of our policy, completing the pick-and-place tasks safely on a seen object (block) and an unseen object (banana) on a Franka Panda robot with 6 different grippers, including a 3D-printed fingertips and a Robotiq 2F-85 gripper.

Our policy also ensures the completion of pick-and-place tasks for various kinds of unseen objects in a variation of placed positions. Even though the quality of the point cloud is imprecise (generated by Intel® RealSense™ Depthcamera D455).

Grasping Probability Map

Visuomotor policies, like Diffusion policy and 3D Diffusion Policy, depend on visual observations to generate robot trajectories. However, swapping out different grippers during the online execution can alter visual observations (both RGB and point cloud inputs). To address this issue, we propose a grasping probability map Gprob* that encodes the probability of successful grasping at each pixel in the image. We adopt the Generative Grasping CNN (GG-CNN) for Gprob* synthesis from depth images. To mitigate the problem of variations of grasping probability while moving and lighting changes, We modified GG-CNN so the spatial filtering maintains grasp affordance information while eliminating outlier predictions caused by sensor noise, as shown in Fig. 2.

Image 2
Fig. 2: The gripper-agnostic grasping knowledge. Dynamic changes in robot pose and gripper variants cause changes in visual observations, including RGB-D images and object grasp probabilities, Gprob* provides stable visual information.
Image 1
Fig. 3: Grasping probabilities map under different conditions.

We evaluate the stability of grasping probability maps under three perturbation scenarios, including graspable object variation, gripper morphology, and viewpoint changes (i.g. end-effector heights). Fig. 3 presents the difference between the original GG-CNN grasping probability map Gprob and our proposed grasping probability map Gprob* under different scenarios through normalized probability histograms, here only shows high-probability parts (> 0.4).

Real-robot experiments

To evaluate our proposed method, we designed experiments to (i) evaluate the stability of grasping probability maps under different object and robot pose conditions, (ii) validate different policy's generalization for pick-and-place tasks across diverse two-finger grippers without retraining the policy, and (iii) assess the generated trajectory's safety and task completion issues across different policies. We define one success of the pick-and-place task as picking the object from the tabletop and placing it on a box is a successful trial. All other outcomes are deemed failures, including failure to grasp the object or dropping it during manipulation. The real-robot experiment is conducted on 6 different grippers.


Interpolate start reference image.
Fig. 4: Gripper variants of different morphologies. (Unit: cm)

We compare our method's average success rate with 2 baselines: (i) Diffusion Policy and (ii) Diffusion Policy 3D. We also conduct ablation studies to evaluate the effectiveness of our method via (i) Ours w/o Projection and (ii) Ours w/o Grasping Probability Maps Gprob* :


Average success rates (%) for pick-and-place task on seen object (block)
Method G0 G1 G2 G3 G4 G5 G6 Average success rate
Diffusion Policy200604004026.7
Diffusio Policy 3D20060600023.3
DP + Projection100------
DP3 + Projection80------
Ours w/o Projection100060400033.3
Ours w/o Gprob*80204080100053.3
Our method100801008010010093.3

Average success rates (%) for pick-and-place task on unseen object (banana)
Method G0 G1 G2 G3 G4 G5 G6 Average success rate
Diffusion Policy2004040402030
Diffusio Policy 3D200404020020
Ours w/o Projection800404002030
Ours w/o Gprob*600406040033.3
Our method806010060606070

Baseline Methods

We also analyze how gripper mapping and safety-constrained trajectory projection ensure safe trajectory generation across different grippers. Baseline policies Diffusion Policy and Diffusion Policy 3D often result in collisions.


After introducing the safety-constrained trajectory projection to DP and DP 3D on the inference phase, the generated motions are projected within the defined safety constraints, enabling both policies to successfully complete the task without collisions


Citation

If you find this work useful, please consider citing:

@misc{yao2025pickandplacemanipulationgrippersretraining,
    title={Pick-and-place Manipulation Across Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach}, 
    author={Xiangtong Yao and Yirui Zhou and Yuan Meng and Liangyu Dong and Lin Hong and Zitao Zhang and Zhenshan Bing and Kai Huang and Fuchun Sun and Alois Knoll},
    year={2025},
    eprint={2502.15613},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2502.15613},
  }