6.2. Prerequisites

6.2.1. Robot Platform and Controller

We have run our system with several humanoid robots over the years, including the Boston Dynamics DARPA Robotics Challenge Finals-Era Atlas, NASA’s Valkyrie, IHMC and Boardwalk Robotics’ Nadia, Unitree’s H1-2, and IHMC’s Alex. Our system is centered on the humanoid form factor with footstep, walk, arm, leg, spine, and neck actions. It is also centered on the presence of a whole body controller that can achieve those actions. Ideally, the whole body controller can triage and queue asynchronous commands of those actions, to support concurrently tracked motions of different parts of the body that may start and end at different times. For communication, we have been using a ROS 2 compatible DDS implementation, but other communication protocols could be subbed in with engineering work. The controller could also be operated through a programmatic API and some of the interprocess comms avoided.

To get the best performance, the robot’s controller should have push-tolerant standing and walking. Even better if the robot can walk with the arms held up in front of the body. If it can do this, it would be sufficient for basic multi-station loco-manipulation tasks and door traversals.

6.2.2. Motion and Reachability

We think longer arms generally work better than shorter ones. A human-proportioned robot arm yields significantly less reachability than a human arm since humanoid robots typically don’t have clavicle joints which extend a human’s arm reach. We tend to think that robot arms that are long enough such that the hands touch the knees when they are hanging are a good balance. DRC Atlas and Nadia had these longer-arm proportions, but Alex didn’t and we felt that it limited reachability and increased task complexity for the pull door behaviors and when reaching across tables.

Similarly, a high degree of freedom spine can help simplify behaviors. For example, being able to yaw 90 degrees to the left and right can allow for faster scanning behaviors, increase arm reachability, and avoid unnecessary footsteps in tight spaces.

6.2.3. Perception and Teleoperation

On the perception side, the most significant requirements are a color image, a high-accuracy depth image, and the availability of high quality YOLO models for any objects you want to interact with. There are many sensors available on the market that satisfy this requirement. We have tried the ZED X Mini and the RealSense D457 [1]. We think both of those would work.

Our preference and the one we currently use is the ZED X Mini because it has a stereo RGB camera pair which is useful for direct VR teleoperation. To author complicated behaviors, an idea ahead of time of how the robot will accomplish the task is desirable. With humanoid robots, this is helped by our natural intuition since the robot shares our form factor, but how the robot’s capabilities differ from our own can be unintuitive. This is why teleoperating a task can be a good way to quickly experiment with task strategies that work within the bounds of the robot’s capabilities, and why we chose the ZED X Mini. The ZED X Mini also has a human-like inter-pupillary distance, which makes for a natural robot embodiment by a VR teleoperator.

The other major thing you will need is a library of YOLO models for everything you need to interact with. We have been using YOLOv8 because it does detections and segmentation for objects of interest. Arghya Chatterjee crafted the models we used, but it took him weeks and weeks of hand labeling and a year or so of experimentation to achieve high confidence detections. It might be good to find a vendor that provides professionally trained models. Typically the data needed is around five videos of the task, each on a different day and time, light and dark, where the videos orbit around the object and get close and far from it.

6.2.4. Hands

For robot hands, we recommend a 5-finger anthropomorphic hand. We had a great experience with the PSYONIC Ability Hand [23]. One major advantage of a 5-finger anthropomorphic hand is that, similar to the rest of the robot’s body, it mirrors the human form, which allows for an increased level of intuition in task planning. For example, a good start for planning how to get the robot to do a task is to just do it yourself and look at how you naturally do it.

Furthermore, this extends to teleoperation, where you can constrain the set of possible solutions to what is achievable by the robot while still leaning on natural intuition. We use the Valve Index Controller’s [24] API to teleoperate the Ability Hands and it works great. In fact, the OpenVR software maps the Valve Index Controller’s finger estimation to six degrees of freedom which happens to match perfectly to the six degrees of freedom on the Ability Hand.

6.2.5. Compute

Another thing you’ll need are NVIDIA GPUs if you want to use the ZED camera or if you want to use CUDA kernels. The ZED SDK requires NVIDIA GPUs to use its most essential features such as its neural-assisted stereo depth estimation. Additionally, we use CUDA heavily throughout the system, including for counting points and averaging color inside virtual shapes in the 3D point cloud, faster OpenCV functions, processing the output tensor of YOLO, non-maximum suppression of YOLO bounding boxes, and extraction of 3D points from the YOLO segmentation masks.

We use a Jetson AGX Orin on-board the robot and a desktop computer with an NVIDIA Turing GPU for the operator computer. We run Linux on both except when doing VR teleoperation, when we use Windows. We have had success with VR on Linux in simulation tests, but not on the real robot. For VR headsets we’ve been using the Valve Index and the HTC Vive Focus 3.

In the next several sections we will walk through some token behavior authoring examples while explaining how system components work as we encounter them. This is a guide to how the system works, but it’s meant for reading and understanding more than it’s meant to be literally followed as a tutorial. We’ll first use simulation examples to illustrate and explain some basic mechanisms without noise. Then, we’ll go through a real-robot authoring session: our 32 minute authoring session to get Unitree H1-2 opening a door repeatedly.

References cited on this page

[1] Intel Corporation, “Intel RealSense Depth Camera D457.” https://www.intelrealsense.com/depth-camera-d457/, 2022.

[23] PSYONIC, “Ability hand.” [Online]. Available: https://www.psyonic.io/ability-hand

[24] Valve, “Valve index controllers.” [Online]. Available: https://store.steampowered.com/app/1059550/Valve_Index_Controllers/