Language-conditioned policies like RT-1 struggle to generalize to new scenarios that require extrapolation of language specifications. To this end, we propose RT-Trajectory, a robotic control policy conditioned on trajectory sketches: a novel conditioning method which is practical, easy to specify, and allows effective generalization to novel tasks beyond the training data.
Generalization remains one of the most important desiderata for robust robot learning systems. While recently proposed approaches show promise in generalization to novel objects, semantic concepts, or visual distribution shifts, generalization to new tasks remains challenging. For example, a language-conditioned policy trained on pick-and-place tasks will not be able to generalize to a folding task, even if the arm trajectory of folding is similar to pick-and-place. Our key insight is that this kind of generalization becomes feasible if we represent the task through rough trajectory sketches. We propose a policy conditioning method using such rough trajectory sketches, which we call RT-Trajectory, that is practical, easy to specify, and allows the policy to effectively perform new tasks that would otherwise be challenging to perform. We find that trajectory sketches strike a balance between being detailed enough to express low-level motion-centric guidance while being coarse enough to allow the learned policy to interpret the trajectory sketch in the context of situational visual observations. In addition, we show how trajectory sketches can provide a useful interface to communicate with robotic policies -- they can be specified through simple human inputs like drawings or videos, or through automated methods such as modern image-generating or waypoint-generating methods, such as with VLMs or LLMs. We evaluate RT-Trajectory at scale on a variety of real-world robotic tasks, and find that RT-Trajectory is able to perform a wider range of tasks compared to language-conditioned and goal-conditioned policies, when provided the same training data.
We propose RT-Trajectory, which utilizes coarse trajectory sketches for policy conditioning. We train on hindsight trajectory sketches (top left) and evaluate on inference trajectories (bottom left) produced via Trajectory Drawings, Human Videos, or Foundation Models. These trajectory sketches are used as task specification for an RT-1 policy backbone (right).
For each episode in the dataset of demonstrations, we extract a 2D trajectory of robot end-effector center points. Concretely, given the proprioceptive information recorded in the episode, we obtain the 3D position of the robot end-effector center defined in the robot base frame at each time step, and project it to the camera space given the known camera extrinsic and intrinsic parameters. Given a 2D trajectory (a sequence of pixel positions), we draw a curve on a blank image, by connecting 2D robot end-effector center points at adjacent time steps through straight lines. We draw green (or blue) circles at the 2D robot tool center points of all key time steps for closing (or opening) the gripper.
Human-drawn sketches are an intuitive and practical way to generate trajectory sketches at inference time. To scalably produce these sketches during evaluations, we design a simple GUI for users to draw trajectory sketches given the robot's initial camera image.
We propose 7 new skills for evaluation which involve unseen objects, manipulation workspaces, and novel motions, to study whether RT-Trajectory can generalize to tasks beyond those contained in the training dataset. The skill Fold Towel is shown in the Human Demonstration section.
Language-conditioned policies struggle to generalize to the new tasks with semantically unseen language instructions, even if motions to achieve these tasks were seen during training. RT-1-goal shows better generalization than its language-conditioned counterparts. However, goal conditioning is much harder to acquire than trajectory sketches during inference in new scenes and is sensitive to task-irrelevant factors (e.g., backgrounds). RT-Trajectory (2.5D) outperforms RT-Trajectory (2D) on the tasks where height information helps reduce ambiguity. For example, with 2D trajectories only, it is difficult for RT-Trajectory (2D) to infer correct picking heights, which is critical for the Pick from Chair evaluations.
We also study first-person human single-hand demonstration videos. We estimate the trajectory of human hand poses from the video, and convert it to a trajectory of robot tool poses, which can be used to generate a trajectory sketch. We collect 18 and 4 first-person human demonstration videos with hand-object interaction for Pick and Fold Towel.
Caption: Each visualization shows the human video demonstration (left), the RT-Trajectory policy rollout (middle), and the trajectory sketch overlaid on the rollout (right).
Caption: Each visualization shows the human demonstration (left), the RT-Trajectory policy rollout (middle), and the trajectory sketch overlaid on the rollout (right).
We prompt an LLM to write code to generate trajectories given the task instructions and object labels for Pick and Open Drawer. After executing the code written by the LLM, we get a sequence of target robot waypoints which can then be processed into a trajectory sketch.
For instance, consider the Open Drawer task. We first detect the objects of interests (ie. a drawer handle) by a visual-language model (VLM). Then, we prompt GPT-4 to write code to generate trajectories given the task instructions ("open the top drawer") and object poses as well as object sizes.
# There are two drawer handles, and the top-most should be compared along the z-axis.
# First handle has z-value of 0.63, second has z-value of 0.12.
# The positive z direction corresponds to up, so the first handle is the top-most.
top_drawer_handle_position = objects['drawer handle'][0]['centroid_pose']['position']
top_drawer_handle_orientation = objects['drawer handle'][0]['centroid_pose']['orientation']
top_drawer_handle_size = objects['drawer handle'][0]['size']
# The handle has a bounding box size of [0.00 , 0.12 , 0.06] in meters and is located on the front x-y plane of the cabinet.
# The gripper has a max span of 10 cm, the size of the handle along the y-axis is 0.12 so it can only grasp the handle with fingers aligned along the x axis and z axis.
# A bottom grasp is ruled out since the robot would collide with the cabinet.
# A back grasp is unfeasible since the robot is in front of the cabinet and cannot go around it to make a back grasp.
# A top grasp with fingers aligned with the x-axis or z-axis is feasible.
# A front grasp with fingers aligned with the z-axis and a side grasp with fingers aligned with the x-axis are feasible too.
# We choose the front grasp with fingers aligned with the z-axis for simplicity.
# Get quaternion corresponding to [-90, 0, -90] roll,pitch and yaw for a front grasp with fingers aligned with the z-axis.
grasp_orientation_quaternion = robot_api.orientation_quaternion_from_euler(-90, 0, -90)
# Calculate grasp position so object ends within gripper fingers.
grasp_pose = {'position': top_drawer_handle_position, 'orientation': grasp_orientation_quaternion}
# The pregrasp pose is the pose right before the grasp.
# Since this is a front grasp, this means the gripper is pointing towards the positive x axis, so the pregrasp_pose has a negative x delta over the grasp pose.
# Calculate pregrasp pose accounting for object size and gripper size (0.1 m).
pregrasp_pose = {'position': grasp_pose['position'] + [-top_drawer_handle_size[0]/2 - 0.1, 0, 0], 'orientation': grasp_orientation_quaternion}
# Open the gripper according to the z axis size of the handle plus a buffer of 2 cm.
robot_api.gripper_open((top_drawer_handle_size[2] + 0.03)/0.1)
robot_api.follow_arm_trajectory([pregrasp_pose, grasp_pose], allow_base_moves=True)
robot_api.gripper_close()
# handle pose is not valid anymore since we might have moved the base so we use the current arm pose.
current_arm_pose = robot_api.get_arm_pose()
# The handle is at the front of the cabinet. Opening means moving the object (handle) away from their reference (cabinet) along the x axis.
# The cabinet is at x_cabinet = 1.12 and the handle is at x_handle = 0.69.
# When the object coordinate is lower than its reference, to increase distance you need to substract a delta and to decrease distance you need to add a delta.
# When the object coordinate is greater than its reference, to increase distance you need to add a delta and to decrease distance you need to substract a delta.
# Since x_handle is lower than x_cabinet, it means the object coordinate is lower than its reference, so to increase the distance between the two we substract a positive delta to x_handle.
open_drawer_pose = {'position': current_arm_pose['position'] + [-0.25, 0, 0], 'orientation': current_arm_pose['orientation']}
# Allow for base moves for after grasp moves since arm could be in a difficult position to execute the open.
robot_api.follow_arm_trajectory([open_drawer_pose], allow_base_moves=True)
Caption: Each visualization shows the RT-Trajectory policy rollout (left), and the trajectory sketch overlaid on the rollout (right).
Caption: Each visualization shows the RT-Trajectory policy rollout (left), and the trajectory sketch overlaid on the rollout (right).
In our work, we use a PaLM-E style model that generates vector-quantized tokens derived from ViT-VQGAN that represent the trajectory image. Once detokenized, the resulting image can be used to condition RT-Trajectory.
Compared to non-learning methods, RT-Trajectory is able to recover from execution failures. The retry behavior emerged when RT-Trajectory was opening the drawer given the trajectory sketch generated by the Code as Policies. After a failure attempt to open the drawer by its handle, the robot retried to grasp the edge of the drawer, and managed to pull the drawer.
2D trajectories (without depth information) are visually ambiguous for distinguishing whether the robot should move its arm to a deeper or higher location. We find that height-aware color grading utilized in RT-Trajectory (2.5D) can effectively help reduce such ambiguity.
As a qualitative case study, we evaluate RT-Trajectory in 2 new buildings in 4 realistic novel rooms which contain entirely new backgrounds, lighting conditions, objects, layouts, and furniture geometries. With little to moderate trajectory prompt engineering, we find that RT-Trajectory is able to successfully perform a variety of tasks requiring novel motion generalization and robustness to out-of-distribution visual distribution shifts.
@misc{gu2023rttrajectory,
title={RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches},
author={Jiayuan Gu and Sean Kirmani and Paul Wohlhart and Yao Lu and Montserrat Gonzalez Arenas and Kanishka Rao and Wenhao Yu and Chuyuan Fu and Keerthana Gopalakrishnan and Zhuo Xu and Priya Sundaresan and Peng Xu and Hao Su and Karol Hausman and Chelsea Finn and Quan Vuong and Ted Xiao},
year={2023},
eprint={2311.01977},
archivePrefix={arXiv},
primaryClass={cs.RO}
}