Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models

1Carnegie Mellon University
Corresponding author: pkatara@andrew.cmu.edu

Abstract

Generalist robot manipulators need to learn a wide variety of manipulation skills across diverse environments. Current robot training pipelines rely on humans to provide kinesthetic demonstrations or to program simulation environments and to code up reward functions for reinforcement learning. Such human involvement is an important bottleneck towards scaling up robot learning across diverse tasks and environments. We propose Generation to Simulation (Gen2Sim), a method for scaling up robot skill learning in simulation by automating generation of 3D assets, task descriptions, task decompositions and reward functions using large pre-trained generative models of language and vision. We generate 3D assets for simulation by lifting open-world 2D object-centric images to 3D using image diffusion models and querying LLMs to determine plausible physics parameters. Given URDF files of generated and human-developed assets, we chain-of-thought prompt LLMs to map these to relevant task descriptions, temporal decompositions, and corresponding python reward functions for reinforcement learning. We show Gen2Sim succeeds in learning policies for diverse long horizon tasks, where reinforcement learning with non temporally decomposed reward functions fails. Gen2Sim provides a viable path for scaling up reinforcement learning for robot manipulators in simulation, both by diversifying and expanding task and environment development, and by facilitating the discovery of reinforcement-learned behaviors through temporal task decomposition in RL. Our work contributes hundreds of simulated assets, tasks and demonstrations, taking a step towards fully autonomous robotic manipulation skill acquisition in simulation.

Gen2Sim Skill Acquisition Demo

In this demo, we visualize the generated environment, task descriptions, task decomposition and reward function from Gen2Sim. We show manipulation skills acquired while interacting with assets with varying part affordances and part structures. We also show skill chaining to achieve long-horizon tasks in complex environments.


Manipulation Skills

Generated Task and Reward:
                        [sep]
                        videos/part_affordances/dishwasher/response_1.txt Generated Task and Reward:
                        [sep]
                        videos/part_affordances/dishwasher/response_2.txt Generated Task and Reward:
                        [sep]
                        videos/part_affordances/microwave/response_1.txt Generated Task and Reward:
                        [sep]
                        videos/part_affordances/microwave/response_2.txt Generated Task and Reward:
                        [sep]
                        videos/part_affordances/oven/response_1.txt Generated Task and Reward:
                        [sep]
                        videos/part_affordances/safe/response_1.txt Generated Task and Reward:
                        [sep]
                        videos/part_structure/response_1.txt Generated Task and Reward:
                        [sep]
                        videos/part_structure/response_2.txt Generated Task and Reward:
                        [sep]
                        videos/part_structure/response_3.txt Generated Task and Reward:
                        [sep]
                        videos/part_structure/response_4.txt Generated Task and Reward:
                        [sep]
                        videos/long_horizon/tennis_ball/response_1.txt
Select an image above:
Gen2Sim response shown within code block.

Gen2Sim Components

Overview: We generate 3D assets for simulation by lifting open-world 2D object-centric images to 3D using image diffusion models and querying LLMs to determine plausible physics parameters. Given URDF files of generated and human-developed assets, we chain-of-thought prompt LLMs to map these to relevant task descriptions, temporal decompositions, and corresponding python reward functions for reinforcement learning.
Asset Generation
We generate Visual, Collision and Physics parameters of a URDF by converting images to 3D meshes using a NeRF distillation pipeline, followed by a differentiable mesh fine-tuning step. We query LLMs about the physics properties like mass, dimensions based on asset's category.



Generating 3D Assets from in-the-wild images: We generate hundreds of assets with diverse appearance, geometry and physics.



Task and Reward Generation
We extract relevant context from URDF files of generated and human-developed assets, and chain-of-thought prompt LLMs to map these to relevant task descriptions, temporal decompositions, and corresponding python reward functions. Finally we use PPO, a model-free reinforcement learning algorithm to acquire skills in simulation.

BibTeX

@misc{katara2023gen2sim,
            title={Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models}, 
            author={Pushkal Katara and Zhou Xian and Katerina Fragkiadaki},
            year={2023},
            eprint={2310.18308},
            archivePrefix={arXiv},
            primaryClass={cs.RO}
      }