RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches

Jiayuan Gu^1,2, Sean Kirmani¹, Paul Wohlhart¹, Yao Lu¹, Montserrat Gonzalez Arenas¹, Kanishka Rao¹,
Wenhao Yu¹, Chuyuan Fu¹, Keerthana Gopalakrishnan¹, Zhuo Xu¹, Priya Sundaresan^3,4, Peng Xu¹,
Hao Su², Karol Hausman¹, Chelsea Finn^1,3, Quan Vuong¹, Ted Xiao¹

¹Google DeepMind, ²University of California San Diego ³Stanford University ⁴Intrinsic

ICLR 2024 (Spotlight)

arXiv Blog

RT-1 vs. RT-Trajectory

RT-Trajectory can take trajectory sketches generated from different methods as input.

Language-conditioned policies like RT-1 struggle to generalize to new scenarios that require extrapolation of language specifications. To this end, we propose RT-Trajectory, a robotic control policy conditioned on trajectory sketches: a novel conditioning method which is practical, easy to specify, and allows effective generalization to novel tasks beyond the training data.

Abstract

Generalization remains one of the most important desiderata for robust robot learning systems. While recently proposed approaches show promise in generalization to novel objects, semantic concepts, or visual distribution shifts, generalization to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies -- they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods, such as with VLMs or LLMs. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data.

Overview

We propose RT-Trajectory, which utilizes coarse trajectory sketches for policy conditioning. We train on hindsight trajectory sketches (top left) and evaluate on inference trajectories (bottom left) produced via Trajectory Drawings, Human Videos, or Foundation Models. These trajectory sketches are used as task specification for an RT-1 policy backbone (right).

For Training: Hindsight Trajectory Labels

2D Trajectory + Interaction Markers

For each episode in the dataset of demonstrations, we extract a 2D trajectory of robot end-effector center points. Concretely, given the proprioceptive information recorded in the episode, we obtain the 3D position of the robot end-effector center defined in the robot base frame at each time step, and project it to the camera space given the known camera extrinsic and intrinsic parameters. Given a 2D trajectory (a sequence of pixel positions), we draw a curve on a blank image, by connecting 2D robot end-effector center points at adjacent time steps through straight lines. We draw green (or blue) circles at the 2D robot tool center points of all key time steps for closing (or opening) the gripper.

The red curve represents the projected 2D trajectory. The green marker indicates where the gripper closes and the blue one indicates where the gripper opens.

Color Grading

To express relative temporal motion, which encodes properties such as velocity and direction, we also explore using the red channel of the RGB trajectory image to specify the normalized time step.

Additionally, we propose incorporating height information into the trajectory representation by utilizing the green channel of the RGB trajectory image to encode normalized height.

Trajectory Representations

We propose two forms of trajectory representation which are composed of the basic elements described above: RT-Trajectory (2D) and RT-Trajectory (2.5D). Here are two example trajectory sketches:

RT-Trajectory (2D): 2D trajectory with temporal information and interaction markers.

RT-Trajectory (2.5D): RT-Trajectory (2D) + height information.

Seen Skills

We use the RT-1 demonstration dataset for training, which contains 551 instructions inspired by an office kitchen setting. The language instructions consist of 8 different manipulation skills operating on a set of 17 household kitchen items; in total, the dataset consists of 73,334 real robot demonstrations across these 551 training tasks, which were collected by manual teleoperation.

Here are example rollouts of these skills. The corresponding language instruction are shown below the video.

pick orange can

move pepsi can near 7up can

knock orange can over

place pepsi can upright

pick apple from middle drawer and place on counter

place pepsi can into top drawer

open bottom drawer

close middle drawer

For Inference: Human Drawings

Human-drawn sketches are an intuitive and practical way to generate trajectory sketches at inference time. To scalably produce these sketches during evaluations, we design a simple GUI for users to draw trajectory sketches given the robot's initial camera image.

Unseen Skills

We propose 7 new skills for evaluation which involve unseen objects, manipulation workspaces, and novel motions, to study whether RT-Trajectory can generalize to tasks beyond those contained in the training dataset. The skill Fold Towel is shown in the Human Demonstration section.

Place Fruit

Place Fruit inspects whether the policy can place objects into unseen containers.

Upright and Move

Upright and Move examine whether the policy can combine distinct seen skills (Place Upright and Move Near) to form a new skill.

Move within Drawer

Move within Drawer studies whether the policy is able to move objects within the drawer while the seen skill Move Near only covers those motions at a fixed tabletop height.

Restock Drawer

Restock Drawer requires the robot to place snacks into the drawer at a precise empty slot. It studies whether the policy is able to place objects at target positions precisely.

Pick from Chair

Pick from Chair investigates whether the policy can pick objects at an unseen height in an unseen manipulation workspace.

Swivel Chair

Swivel Chair showcase the capability to interact with an underactuated system at a novel height with a novel motion.

Quantitative Results

We compare RT-Trajectory with other learning-based baselines on generalization to unseen task scenarios.

RT-1: language-conditioned policy trained on the same training data;
RT-2: language-conditioned policy trained on a mixture of our training data and internet-scale VQA data;
RT-1-goal: goal-conditioned policy trained on the same training data.

Caption: Success rates for unseen tasks when conditioning with human-drawn trajectory sketches. Scenarios contain a variety of difficult settings which require combining seen motions in novel ways or generalizing to new motions. Each policy is evaluated for a total of 64 trials across 7 different scenarios.

Language-conditioned policies struggle to generalize to the new tasks with semantically unseen language instructions, even if motions to achieve these tasks were seen during training. RT-1-goal shows better generalization than its language-conditioned counterparts. However, goal conditioning is much harder to acquire than trajectory sketches during inference in new scenes and is sensitive to task-irrelevant factors (e.g., backgrounds). RT-Trajectory (2.5D) outperforms RT-Trajectory (2D) on the tasks where height information helps reduce ambiguity. For example, with 2D trajectories only, it is difficult for RT-Trajectory (2D) to infer correct picking heights, which is critical for the Pick from Chair evaluations.

For Inference: Human Demonstration Videos with Hand-object Interaction

We also study first-person human single-hand demonstration videos. We estimate the trajectory of human hand poses from the video, and convert it to a trajectory of robot tool poses, which can be used to generate a trajectory sketch. We collect 18 and 4 first-person human demonstration videos with hand-object interaction for Pick and Fold Towel.

Fold Towel

Caption: Each visualization shows the human video demonstration (left), the RT-Trajectory policy rollout (middle), and the trajectory sketch overlaid on the rollout (right).

Pick

Caption: Each visualization shows the human demonstration (left), the RT-Trajectory policy rollout (middle), and the trajectory sketch overlaid on the rollout (right).

For Inference: Prompting LLMs with Code as Policies

We prompt an LLM to write code to generate trajectories given the task instructions and object labels for Pick and Open Drawer. After executing the code written by the LLM, we get a sequence of target robot waypoints which can then be processed into a trajectory sketch.

For instance, consider the Open Drawer task. We first detect the objects of interests (ie. a drawer handle) by a visual-language model (VLM). Then, we prompt GPT-4 to write code to generate trajectories given the task instructions ("open the top drawer") and object poses as well as object sizes.

Detection results

LLM output

# There are two drawer handles, and the top-most should be compared along the z-axis.
# First handle has z-value of 0.63, second has z-value of 0.12.
# The positive z direction corresponds to up, so the first handle is the top-most.
top_drawer_handle_position =  objects['drawer handle'][0]['centroid_pose']['position']
top_drawer_handle_orientation =  objects['drawer handle'][0]['centroid_pose']['orientation']
top_drawer_handle_size = objects['drawer handle'][0]['size']
# The handle has a bounding box size of [0.00 , 0.12 , 0.06] in meters and is located on the front x-y plane of the cabinet.
# The gripper has a max span of 10 cm, the size of the handle along the y-axis is 0.12 so it can only grasp the handle with fingers aligned along the x axis and z axis.
# A bottom grasp is ruled out since the robot would collide with the cabinet.
# A back grasp is unfeasible since the robot is in front of the cabinet and cannot go around it to make a back grasp.
# A top grasp with fingers aligned with the x-axis or z-axis is feasible.
# A front grasp with fingers aligned with the z-axis and a side grasp with fingers aligned with the x-axis are feasible too.
# We choose the front grasp with fingers aligned with the z-axis for simplicity.
# Get quaternion corresponding to [-90, 0, -90] roll,pitch and yaw for a front grasp with fingers aligned with the z-axis.
grasp_orientation_quaternion = robot_api.orientation_quaternion_from_euler(-90, 0, -90)
# Calculate grasp position so object ends within gripper fingers.
grasp_pose = {'position': top_drawer_handle_position, 'orientation': grasp_orientation_quaternion}
# The pregrasp pose is the pose right before the grasp.
# Since this is a front grasp, this means the gripper is pointing towards the positive x axis, so the pregrasp_pose has a negative x delta over the grasp pose.
# Calculate pregrasp pose accounting for object size and gripper size (0.1 m).
pregrasp_pose = {'position': grasp_pose['position'] + [-top_drawer_handle_size[0]/2 - 0.1, 0, 0], 'orientation': grasp_orientation_quaternion}
# Open the gripper according to the z axis size of the handle plus a buffer of 2 cm.
robot_api.gripper_open((top_drawer_handle_size[2] + 0.03)/0.1)
robot_api.follow_arm_trajectory([pregrasp_pose, grasp_pose], allow_base_moves=True)
robot_api.gripper_close()
# handle pose is not valid anymore since we might have moved the base so we use the current arm pose.
current_arm_pose = robot_api.get_arm_pose()
# The handle is at the front of the cabinet. Opening means moving the object (handle) away from their reference (cabinet) along the x axis.
# The cabinet is at x_cabinet = 1.12 and the handle is at x_handle = 0.69.
# When the object coordinate is lower than its reference, to increase distance you need to substract a delta and to decrease distance  you need to add a delta.
# When the object coordinate is greater than its reference, to increase distance you need to add a delta and to decrease distance  you need to substract a delta.
# Since x_handle is lower than x_cabinet, it means the object coordinate is lower than its reference, so to increase the distance between the two we substract a positive delta to x_handle.
open_drawer_pose = {'position': current_arm_pose['position'] + [-0.25, 0, 0], 'orientation': current_arm_pose['orientation']}
# Allow for base moves for after grasp moves since arm could be in a difficult position to execute the open.
robot_api.follow_arm_trajectory([open_drawer_pose], allow_base_moves=True)

Open Drawer

Caption: Each visualization shows the RT-Trajectory policy rollout (left), and the trajectory sketch overlaid on the rollout (right).

Pick

Caption: Each visualization shows the RT-Trajectory policy rollout (left), and the trajectory sketch overlaid on the rollout (right).

For Inference: Image Generation Models

In our work, we use a PaLM-E style model that generates vector-quantized tokens derived from ViT-VQGAN that represent the trajectory image. Once detokenized, the resulting image can be used to condition RT-Trajectory.

Here we showcase some qualitative results. Each visualization shows the RT-Trajectory policy rollout (left), and the trajectory sketch overlaid on the rollout (right). The corresponding language instruction are shown below the video.

open middle drawer

pick orange can from top drawer and place on counter

place orange into middle drawer

pick green jalapeno chip bag

move 7up can near blue plastic bottle

move apple near green jalapeno chip bag

move blue plastic bottle near pepsi can

move redbull can near sponge

Case Studies

Retry Behaviors

Compared to non-learning methods, RT-Trajectory is able to recover from execution failures. The retry behavior emerged when RT-Trajectory was opening the drawer given the trajectory sketch generated by the Code as Policies. After a failure attempt to open the drawer by its handle, the robot retried to grasp the edge of the drawer, and managed to pull the drawer.

Below we show two example rollouts that illustrate retry behaviors. Each visualization shows the RT-Trajectory policy rollout (left), and the trajectory sketch overlaid on the rollout (right).

Height-aware Behaviors

2D trajectories (without depth information) are visually ambiguous for distinguishing whether the robot should move its arm to a deeper or higher location. We find that height-aware color grading utilized in RT-Trajectory (2.5D) can effectively help reduce such ambiguity.

RT-Trajectory (2D) incorrectly moves the object to a deeper position due to the ambiguity of a 2D trajectory.

RT-Trajectory (2.5D) correctly lifts the object.

More Qualitative Results

As a qualitative case study, we evaluate RT-Trajectory in 2 new buildings in 4 realistic novel rooms which contain entirely new backgrounds, lighting conditions, objects, layouts, and furniture geometries. With little to moderate trajectory prompt engineering, we find that RT-Trajectory is able to successfully perform a variety of tasks requiring novel motion generalization and robustness to out-of-distribution visual distribution shifts.

Move a pan onto the stove in an unseen kitchen

Open cabinet door in an unseen kitchen

Close the cabinet door in an unseen kitchen

Pick the remote from a low desk in an unseen living room

Place the remote into the wooden drawer in an unseen living room

Close the wooden drawer in an unseen living room

Pick a toothpaste in an unseen bathroom

Pick a toothbrush in an unseen bathroom

Move the cup in an unseen dining room

Lift the lid of the pumpkin basket

Clean the table with a sponge, following "M"

Clean the table with a towel, following "M"

Clean the table with a duster, following "Z"

Citation

@misc{gu2023rttrajectory,
      title={RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches}, 
      author={Jiayuan Gu and Sean Kirmani and Paul Wohlhart and Yao Lu and Montserrat Gonzalez Arenas and Kanishka Rao and Wenhao Yu and Chuyuan Fu and Keerthana Gopalakrishnan and Zhuo Xu and Priya Sundaresan and Peng Xu and Hao Su and Karol Hausman and Chelsea Finn and Quan Vuong and Ted Xiao},
      year={2023},
      eprint={2311.01977},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}