Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

Abstract

A central challenge towards developing robots that can relate human language to their perception and actions is the scarcity of natural language annotations in diverse robot datasets. Moreover, robot policies that follow natural language instructions are typically trained on either templated language or expensive human-labeled instructions, hindering their scalability. To this end, we introduce NILS: Natural language Instruction Labeling for Scalability. NILS automatically labels uncurated, long-horizon robot data at scale in a zero-shot manner without any human intervention. NILS combines pre-trained vision-language foundation models in a sophisticated, carefully considered manner in order to detect objects in a scene, detect object-centric changes, segment tasks from large datasets of unlabelled interaction data and ultimately label behavior datasets. Evaluations on BridgeV2 and a kitchen play dataset show that NILS is able to autonomously annotate diverse robot demonstrations of unlabeled and unstructured datasets, while alleviating several shortcomings of crowdsourced human annotations.

NILS

Technical Summary

Architecture

NILS Overview Overview of the proposed NILS framework for labeling long-horizon robot play sequences in a zero-shot manner using an ensemble of pretrained expert models. NILS consists of three stages:

all relevant objects in the video are detected
object-centric changes are detected and collected
object change information is used to detect keystates and an LLM is prompted to generate a language label for the task

Examples

These Examples showcase annotations generated by our framework and the respective scene annotations. Press Play to start playing the long-horizon trajectory and sample to sample a new trajectory.

Video

Last Keystate

Scene Annotations

Generated Labels

Example Labeling Videos

These videos showcase the annotations generated by NILS on BridgeV2 and Fractal. The bounding boxes are the boxes obtained after Stage 2 and NILS’ filtering steps.

Annotations for Bridge V2

Annotations for Fractal 2022

Policy Rollouts

These examples showcase some tasks performed by a policy trained on our real-kitchen dataset that is annotated by NILS. The policy is evaluated on the same toy kitchen.

Following examples are rollouts of an Octo policy trained on the BridgeV2 dataset using the labels generated by NILS. Both real-world and simulation (using SimplerEnv) rollouts were performed.

Place the green spoon on top of the rag

Place the sushi inside the wooden bowl

Place the sushi inside the green bowl

Place the yellow spoon on top the blue cloth

Relocate the yellow spoon from the table to inside the blue cloth

Failure Cases

Move the fork forward
Move the fork to the left
Move the fork away from the round object
Move the fork to the left

Move the pan to the left of the stovetop
Clean the pan with the kitchen towel
Place the pan on top of the kitchen towel, next to the chicken wing and spoon
Move the pan 29.5 pixels to the left and 79.5 pixels forward

Wipe the table
Dust the lamp
Polish the silverware
Wipe up the spill

Move the toy corn to the left of the blue cup
Place the toy corn in the center of the table, next to the blue cup
Shift the toy corn 130.5 pixels to the right
Relocate the toy corn from the left side of the table to the center

Move the soda can to the right and place it next to the toy mouse
Pick up the soda can and place it to the left of the toy mouse
Relocate the soda can from its initial position to the left of the toy mouse
Place the soda can next to the toy mouse on its left side

Remove the pot lid from the sausage toy
Lift the pot lid off the sausage toy
Take the pot lid off the sausage toy
Uncover the sausage toy by removing the pot lid

Citation

@inproceedings{
blank2024scaling,
title={Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models},
author={Nils Blank and Moritz Reuss and Marcel R{\"u}hle and {\"O}mer Erdin{\c{c}} Ya{\u{g}}murlu and Fabian Wenzel and Oier Mees and Rudolf Lioutikov},
booktitle={8th Annual Conference on Robot Learning},
year={2024},
url={https://openreview.net/forum?id=EdVNB2kHv1}
}