Abstract
A System for Fast, Resilient, and Adaptable Loco-Manipulation Behaviors on Humanoid Robots
Author: Duncan William Calvert
Department: Department of Intelligent Systems and Robotics
College: College of Science and Engineering
Abstract
Humanoid robots could be used to take on physically demanding, hazardous, and repetitive work in spaces built for humans, including shipyards, energy infrastructure, construction sites, and factories. However, a useful robot for these spaces must coordinate locomotion, whole body motion, perception, contact, and operator supervision. In this thesis we present a system doing this via a behavior authoring and runtime system. Furthermore, we argue that behavior coordination architecture is itself a measurable design variable affecting the ability to readily acheive these tasks in the real world. Specifically, we argue that architecture implicates speed, robustness across task variation, and adaptation time when modifying a behavior for a new task.
We present a robot-local editable behavior architecture in which trees of action primitives execute with layered concurrency on the robot, while an operator interface remains continuously synchronized for runtime authoring, monitoring, and repair. Action primitives are built on top of a whole body controller that supports manipulation while walking. Perception actions are authored inside the behavior so that detection and scene reasoning are scheduled with the rest of the task rather than configured globally. The architecture supports rapid authoring of new behaviors and reuse of existing subtrees across related tasks.
Door traversal is used as the main benchmark task because it exposes the full coordination problem in a compact and repeatable setting, requiring approach walking, body placement, mechanism perception, grasp selection, handle actuation, interaction with a moving obstacle, and a transition back to locomotion. The behavior library developed during this work covers more than twenty real-robot task variants, including push and pull doors with knob, push-bar, and lever-handle mechanisms, multi-step exploration sequences, obstacle clearing and breaching, and reactive table-to-table manipulation tasks. Versions of the same behavior system were run on Nadia, Boston Dynamics DRC Atlas, IHMC’s fully electric Alex humanoid robot, and Unitree H1-2, which shows that the architecture is not tied to one platform.
We evaluate the system on speed, reliability, variation coverage, and authoring effort. A push bar door traversal in 14 seconds and a reactive pick and place demonstration for 6 balls in 42 seconds establish our speed direction. For reliability, we tested 11 successful push approach-and-opening repetitions in a row and 12 successful pull approach-and-opening repetitions in a row on Alex, and 32 successful right pull openings in a row on Unitree H1-2. Measured authoring sessions show that new door behaviors can be created from an empty tree to the start of autonomous repeated-run testing in 31 minutes, and that a mirrored real-door adaptation reaches first success in under two hours, both without code recompilation or redeployment. We compare against prior IHMC baselines and selected door systems from the literature across speed, reliability, variation coverage, and authoring effort, with execution boundary and sensing assumptions reported alongside as part of each comparison. Videos of the work presented in this dissertation are available online at https://www.youtube.com/playlist?list=PLJK5CTyotYqsfgfnXb-09YNFeBose6uEY.
Metrics
In this chapter, we’ll discuss some of our dream requirements for a behavior architecture, in order to set up our goals and the desired properties of a particular implementation. In this thesis we focus on behavior systems that require human expertise to dream up, create, adapt, and modify. It is conceivable that a generally intelligent AI could replace this role in the future, but nevertheless we do not consider that here. This list of characteristics is therefore in the realm of Operator-Robot teams, be it any number of humans and robots. We leave the Operator-Robot ratio open-ended. Potentially, many humans could be managing one robot or one human could be managing a fleet of robots.
Capability
The goal of a behavior system should be to support doing as many tasks as possible to help it achieve maximum utility. The whole point of a robot behavior system, as we are concerned with it in this thesis, is to fill in for the dull, dirty, and dangerous work humans do. We define capability as how many different tasks and their variations can be performed successfully. For example, a system that only supports door traversals is not as capable as one that supports exploring buildings.
Feasibility
Any implementation of a behavior system must be feasible given real-world constraints. It is desirable to not require overly expensive computers or ones that are not readily available. If the robot needs to function autonomously in comms-degraded scenarios, the behavior should not rely on external comms or compute to operate. The behaviors cannot require robot actuation hardware that is not reliable, readily available, or that does not exist. There should be no jetpack flying requirement for behaviors if robots with jetpacks and controllers for them do not exist or only exist as prototypes. The behaviors should not rely on control software that does not exist. For example, modern whole-body controllers do not do much in the way of planning out how to achieve complicated bracing positions and techniques to avoid falling in dynamic scenarios, so the behavior system should support getting the robot into positions and authoring at a primitive action level that allows the operator to reason about what the controller can handle while authoring the behavior actions.
Speed
Behaviors should be watchable at 1x speed. Computational components of the system need to run within their allotted time boundaries, not causing any pauses. The robot hardware and the whole-body controller that the behavior system relies on should be capable of decently fast motions. We are not talking about super fast speeds here, just approaching casual human speed in performing day-to-day chore-like tasks. We want robots to be a drop-in for human work without an immediately huge tradeoff or question mark on speed.
Parallelizability
The system should support moving multiple parts of the body at once and the ability to walk while doing that. This is particularly useful when doing manipulation. Before performing many manipulations, the robot will need to prepare the grasping arm in a pre-grasp-ready pose while putting the other arm in a collision-avoidance pose. Having to get into these ready poses sequentially would cause an unnecessary delay. This goes towards the speed metric, but also important for capability. For example, for traversing spring-loaded doors, the robot must walk while keeping its arm out in the correct locations to prevent the door from closing on the robot.
Reliability
You should be able to execute tasks repeatably without failures. More formally, for a given task, given similar environmental conditions, the robot should consistently perform the task. If it succeeds, it should consistently succeed (and if it fails, it should consistently fail). In other words, you should be able to count on the robot to perform a task repeatedly without random failures that have little to do with environmental variance.
Robustness
The robot and behavior system should be robust to environmental disturbances. These can be both physical and visual. For example, slight pushes to the robot, unmodeled friction and inertias in manipulated items, and changes in lighting due to time of day should not cause task failures. Robustness mechanisms should be present via the whole-body controller or in the behavior system to address these. Any vision models should be trained using data from varied times of day and lighting conditions to prepare them for round-the-clock work.
Resilience
The behavior system should support resilience to changing task conditions and attempt to recover or work around gaps in reliability and robustness. We define resilience to mean being responsive and creative when facing task failures. For example, when the robot is trying to turn a door handle, perhaps a human is present and trying to test the reactivity of the system. In this case, the behavior system should be able to identify that the task is not proceeding nominally and enact some retry strategy. Retry strategies can include simply retry the action sequence, mutating the pose-grasp sequence, or aborting mission entirely and doing something else. Resilience ultimately means surviving day-to-day unexpected events and failures. Attaining resilience is a long-tail robotics problem, but the prior examples are good places to start.
Independence from External Systems
Ideally, the robot can execute behaviors without any connection to the outside world, given the behaviors have been authored and set up ahead of time. This mirrors animals in nature and supports a level of robustness by removing an unintuitive dependence on network communication. It also allows the robot to operate in more environments, including inside buildings with thick concrete walls and rural areas.
Some behavior systems rely on external perception. It is desirable to perceive the world only via the robot and not be dependent on motion capture systems or fiducial markers, which are implements of a laboratory setup. We want the robots using our behavior system to thrive beyond the lab environment and provide useful service in the real world. It is also desirable to have humanoid robots be a drop-in replacement for human workers without having to make robot-specific adjustments to the environment, such as placing fiducial markers. It would be better if robots were to read the same signage and maps as humans.
Dependence on Only Passive Color Vision
By using only passive color vision, the robot mirrors human nature and is more robust to varied lighting conditions. For example, structured light projection sensors can have degraded performance outdoors and in the presence of certain frequencies of light. Additionally, it can be more intuitive and understandable when the robot’s vision modality is similar to human vision. Since we are building humanoid robots to fill in for humans, it could be a more surefire drop-in replacement by matching the mode of vision.
Adaptability of the Operator-Robot Team
Given the near infinite world of tasks robots could help us with, we want a behavior system that can support creating new and adapting existing behaviors to tackle them. Adaptability means being able to survive in a changing environment and, for robots, this means you must be able to readily adapt robot behaviors to changing needs. One of the selling points of humanoid robots is their generality and similarity to humans, which means one application of them is to fill in for human work. Given the adaptiveness of humans and the existence and competitiveness of purpose-built machinery, the value-add for robots must exist in the realm of being an adaptable generalized form. Therefore, it is desirable to be able to create and modify behaviors to tackle various dull, dirty, and dangerous work tasks in a quick time frame.
The following three characteristics are inherently required in building in the adaptability components, as defined by Coactive Design .
Observability
This means knowing the current state of the system in order to understand what is going on. For a robot, there is a lot of information to take in at any particular moment and there are different levels of granularity in doing so. Knowing the current state is required for a human operator to reason about the behavior to modify or adapt it. It is also required to monitor what the robot is accomplishing and determine if it needs help. Let’s list some of the biggest ones:
-
Seeing the current configuration of the robot’s body and hands, visual elements that indicate current forces on the environment, robot hardware status, motor temperatures, and joint faults.
-
Seeing what the robot sees such as the current robot view video stream(s) and current semantic object identification.
-
Getting a feel for the robot’s immediate environmental surroundings and how the robot is situated in it. For example, this can be done via a colored depth point cloud in the 3D view with the robot configuration. If the robot doesn’t have 360-degree vision, mapping may be required or the robot could move the head around to rescan.
-
Knowing the current state of the behavior system and whole-body controller. For example, what state is the behavior in? Is anything currently executing? Has anything failed? What have we done in the recent past and was it successful?
-
Knowing the robot’s current model of the environment. Which objects does it know about? Where does it think they are in 3D? Does the robot know where it is on a map? Is it aware of the major obstacles nearby?
Predictability
This one is about a sense of what is going to happen next both with the robot and the environment. This is required for a human operator to create, adapt, and diagnose behaviors. For example, when authoring the next action(s), it is desirable to see a preview of the motion of the robot and environment as a way of verifying that action. The preview can be inspected for collisions or bad inverse kinematics solutions to avoid failures before executing it on the real robot. Predictability also goes hand-in-hand with authoring at runtime and in mission-critical field scenarios. For example, in the DARPA Robotics Challenge, many tasks, such as getting out of the car, were a step-by-step sequence of predefined robot motions. The robot was being teleoperated live and if the robot fell, the competition would be lost. When doing this task, a preview of the next motions allowed the team to inspect the plan before executing it, increasing confidence and the reliability of the operator-robot team.
On a technical level, it is possible to provide predictability of whole-body motions by playing back a planned animation of future motion, as a transparent colored robot. Primitive graphics like footsteps can be shown to convey where the robot will step next. Color can be used to convey feasibility. For example, a blue transparent graphic of an arm can represent a feasible solution and it can turn red to notify the operator of an infeasible or hard-to-reach configuration. On a longer horizon, a browsable list of actions could be shown which lists all future actions and sub-sequences in the behavior.
Directability
The last of Johnson’s three characteristics in Coactive Design is directability. It is a measure of how expressive the operator can be in commanding the robot to do things. For a humanoid robot, at the basic level, it means being able to command the robot to take steps, walk, move its hands, look around with its head, and generally pose the whole body. At a higher level, the availability of planners increases expressiveness. Good examples of high expressiveness would be the ability to ask the robot to clean a room or to fetch a particular object. We also extend the scope of directability to include non-direct ways of commanding the robot, such as tuning parameters of primitive actions, behavior logic, perception, or scene management. In this way, we want to not only directly command the robot’s physical actions, but also its cognitive model of the world and its plan.
Learnability (Operator Learning Curve)
The operator interface should be designed in a way that facilitates a novice operator in learning how to use it. The behavior operator interface should be interactive and guide the user with cues to point them in the right direction and give them confidence that what they did is what they wanted to do. Nielsen’s 10 Usability Heuristics for User Interface Design is a good reference point for designing a user interface . These 10 heuristics include: visibility of system status, match between the system and the real world, user control and freedom, consistency and standards, error prevention, recognition rather than recall, flexibility and efficiency of use, aesthetic and minimalist design, help users recognize, diagnose, and recover from errors, and help and documentation.
Understandability (of the Implementation)
It would be nice if it were easy to learn how the behavior system works by reviewing the code and observing behaviors in operation. Viewing a behavior’s composition in the user interface should give a good idea of what the behavior does by being organized and supporting abstractions. The use of hierarchical abstractions, for example, can allow the reader to understand the high level at a glance and dive deeper where they want to learn more.
Usability
The user interface should be easy to use, even for an expert operator. Functionality should be organized in meaningful ways, such as grouping like functionalities, scene objects by category, and organizing primitive actions by part of the body. When behaviors get large, there should be mechanisms to abstract their contents into high-level parts. One way to do this, for example, is to structure behaviors hierarchically, such that the higher level layers are more generic, like “navigate to room C”, and lower level layers are more specific like “move hand forward 5 centimeters”. Functionalities should be organized into menus. Buttons and checkboxes should be easy to click. Text and widgets should be easy to read and size-adjustable.
Ability to Analyze in Post
A lot can happen in the course of a robot run and sometimes it can happen very fast. When there are failures or potential improvements, it is often useful to do a post-mortem analysis of what happened. This characteristic is desirable especially because running and supervising robots is stressful and requires attention. We want the ability to log all the data for a robot run and dive into that data later, without the cognitive overhead of the live run. This also gives the operator or behavior engineer the opportunity to view the system in non-live ways. They do not necessarily need the same observability data; they can choose to deep dive into control or logic data, using screen real estate for that instead of live monitoring. The logged data should include (ideally lossless) recordings of the robot’s sensors, behavior state and parameter data over time, controller variables over time, robot configuration state over time, and more. This also implies the availability of post-mortem analysis software, which would allow the logged data to be explored in a rich and interactive way. Examples of this include a slider to scrub data over time, a 3D reconstruction of robot configuration state and 3D depth data, and time plots of controller and behavior logic variables.
Debuggability
The system should provide outputs that assist in debugging when things go wrong or while bringing up new functionalities. Examples of this include good print statements in the robot processes and logging them, sending log messages from the robot to the user interface at runtime, and coloring the log messages by severity and importance. Dynamic user interface elements can also be helpful, for example, when an action fails, making it blink red to draw the operator’s attention. Another way to support debugging is to carefully select a representative set of state variables to log in a time-dependent buffer. These buffers can be streamed live or stored to disk and viewed as scrubbable plots.
Testability
It is desirable to be able to test the system in an automated way. This could be with the real robot, virtually with real data, or using fully simulated data. For example, having test fixtures available for code continuous integration tools to perform simulated behaviors and inspect the results for success and performance characteristics would be helpful to ensure quality and prevent regressions. Testing often requires significant resources as in the case of real robot automated testing. To support these cases, the behavior system should be able to be operated in an automated way and not just by a human operator.
Another case that requires significant resources is fully simulated testing. It can be very difficult to reduce the sim-to-real gap for loco-manipulation behaviors that need realistic vision and physics. Tasks often need to be rigged as articulated simulation assets as in the case of doors, which is a manual process that requires expertise. However, there is also a middle ground in which tradeoffs can be made and components replaced with dummies. For example, poses of objects in the scene could be given via ground truth knowledge, bypassing the vision system entirely or partially.
Extendability
We’re still firmly in the early stages of humanoid robots starting to work well. Any system for running behaviors on humanoid robots should be easy to extend, functionality-wise, to keep pace with the state of the art and maintain competitive usefulness. For example, given the availability of a new footstep planner, it should be a straightforward process to include it as an option. Likewise, if a new comms protocol is adopted, it should not require a complete redesign of the architecture to switch over. Some ways that could help in achieving extendability are keeping the code well tested and maintaining separation of concerns in the design and implementation.
There are a lot of different ways to achieve these characteristics, and in some sense there are tradeoffs depending on your specific requirements. The tradeoffs could be in engineering time or they could be theoretical. For example, if you know your system will only be used by trained expert operators, you can invest more engineering time in the functional and utilitarian aspects like reducing number of clicks or relying more heavily on keyboard shortcuts. However, if your system needs to be usable by a more general audience, more engineering time needs to be spent on Nielsen’s 10 heuristics and you may even want to conduct a user study.
An example of a more theoretical tradeoff would be what you show to the operator at any given time. There is only so much screen real estate and operator attention that can go around, so hard decisions need to be made about the value of information. We think this will vary from system to system and ultimately is based on the confidence levels of the particular subsystems. For example, if you have high trust in your controller’s ability to walk and balance, you may not show balance information to the operator. Conversely, if you depend on a semantic object detection subsystem that is always failing to detect objects, you will likely want that visible in high detail at all times so you can monitor and learn how to either exploit its properties or make informed improvements.
Now that we have defined some desirable characteristics of a good behavior architecture, we’ll tell the story of our journey in navigating the tradeoffs and building one from near-scratch that met our requirements.
Building our Behavior Architecture
In this chapter we’ll tell the story of how our behavior system was inspired, designed, and built over roughly a 10-year time period. Ultimately, we’ll reveal the current design as a snapshot in time in the year 2026, as it is still in active development. The story arc of the behavior system spans from roughly 2014 in the run-up to the DARPA Robotics Challenge (DRC) finals , in which our DRC Atlas robot could perform 8 tasks in 1 hour under scrutinous direct teleoperation, to 2026 when our IHMC Alex humanoid robot can perform door traversals and multi-station loco-manipulation sorting tasks automatically in the tens of seconds regime and with high levels of operator neglect.
2014-2015 DARPA Robotics Challenge Era
Teleoperation and the DRC UI
The autonomous behavior system wasn’t developed heavily until around 2016, after Team IHMC won 2nd place in the DARPA Robotics Challenge Finals. In the DARPA Robotics Challenge, teams were given 1 hour to teleoperate humanoid robots in accomplishing 8 tasks in a mock-urban setting. The tasks were designed such that a humanoid robot might be the most suitable form factor for completing them. They included driving a car, getting out of the car, walking on uneven terrain, using hand tools, going through a door, up a flight of stairs, and operating various machinery. Comparatively, a human would be able to accomplish this set of tasks in under 5 minutes. Most teams were unsuccessful in completing all 8 tasks within the hour and those that did took the greater portion of the hour to do it. Participation in this challenge taught us a lot about the problems involved in getting loco-manipulation behaviors to work.
Though we ended up using teleoperation, we had tried adding degrees of autonomy to increase speed and reliability. The challenge featured low bandwidth communication with the robot at all times and high bandwidth communication only periodically. Teams that had autonomous functionalities on-board the robot would have had the advantage of being able to proceed without waiting through the challenge’s communication blackouts. Teams that did not rely on human-guided perceptual registration would have been able to teleoperate more quickly, with less risk of needing to wait through multiple communications blackouts.
Even through teleoperation, the challenge taught us a lot about the effectiveness of a particular style of user interface for humanoid robot-operator teaming. In the challenge, Team IHMC used a hybrid operator interface comprised of two 3D views: a third-person view with a posable viewport camera and a first-person view that looks at the world from the perspective of the robot’s head camera, such that virtual graphics could be situated within the scene to provide an augmented reality experience, as shown in 1 and 2.
The DRC UI introduced a lot of highly useful user interface mechanisms for managing a humanoid robot in accomplishing tasks. It was based on the Java Swing widget library and jMonkeyEngine for the 3D graphics engine. The DRC UI had a collection of competition-hardened interaction features that worked very well for operating humanoid robots. These included a unique 3D scene camera control and orbit system, quick access buttons with icons, 3D interactable pose gizmos, virtual modifiable plan footsteps, a walk path control ring, sliders for head and spine joint-space commands, virtual object placement, and object action templates.
A video is available at https://youtu.be/TstdKAvPfEs.
Gizmo has become a colloquial term in robotics to describe a virtual interactable 3D set of widgets that assist a user in defining or modifying a 3D position, orientation, or pose. There are a few implementations out there such as in RViz and ImGuizmo , with a representative MoveIt pose-gizmo example shown in 3. The DRC UI featured a custom one built by John Carff in jMonkeyEngine (JME) using colored torii for controlling orientation. The DRC UI gizmo was mainly controlled by keyboard shortcuts, with the F keys being used to select the axis and the arrow keys being used to move the gizmo along or about that axis. It featured orientation and position control and was used to specify poses of the virtual interactable objects such as hands, footsteps, torso motions, and scene objects.
Virtually Placeable Objects
Virtually placeable objects included a door, a hand drill, and variable-sized valves. These virtually placed objects had hand-coded, steppable action scripts in code that were executable with respect to the object. This was very similar, unknowingly, to the Affordance Template framework being designed at TracLabs at around the same time by Steve Hart . Hart’s use of the word affordance was to establish a moniker for the concept of reusable behaviors for common things you might encounter in the environment. The word affordance comes from Gibson’s book, The Ecological Approach to Visual Perception . It is used in the sense that an environmental feature affords an action. For example, the ground is an affordance for walking and a handle is an affordance for turning.
Scripted Behaviors
The DRC-era behaviors were scripts executed from the user interface as seen in 2. The operator would step through the actions one by one. They also did not close the loop on perception. Tasks were aligned to the scene manually by the operator using the point cloud and images.
2016-2019 Atlas, Hard-Coded, JavaFX Era
State Machines and Pipelines
The next generation of behaviors at IHMC started around 2016, where we implemented code-defined behaviors as state machines and pipelines. These behaviors used the whole-body controller API directly, which is largely the same today. It allows asynchronous commands for the various body parts to be submitted in real time. The controller maintains a list of active commands that it is actively controlling the robot to satisfy. For example, you can send a list of footsteps for the robot to take, a 3D hand pose for the robot to reach to, or a list of joint angles for the spine, a leg, an arm, or the neck. Pipelines allowed a level of hard-coded action concurrency by branching out into parallel states before rejoining.
This generation of behaviors started to use perception in the form of basic OpenCV color filters, blob detectors, fiducial detectors, and simple lidar processing. This enabled us to perform behaviors such as pull-door traversals (1) and autonomous ball pickup behaviors. These behaviors were slow in the several minutes range for door traversals and maybe 5 to 10 minutes for picking up colored balls off the floor. They were performed with the Boston Dynamics DRC Atlas.
Perception Limits
Perception was a big limitation in these behaviors. We relied on a MultiSense SL lidar scanner which took some time to spin around and gather sufficient data. We also didn’t have good software for buffering and querying the point cloud quickly. This made it difficult to achieve fast behavior based on perception.
Another key limitation was the architecture. The behaviors were coded by hand, with magic numbers in the code defining the poses of the hands and footsteps relative to objects. This meant that iterating on behaviors required code changes for every tweak. In practice, this was done by running the behavior off-board on the operator computer and using a code-reloading tool called JRebel . The iteration loop required magic number tweaking, code commenting, reordering, and restructuring. This authoring process was an unguided expert code tweaking exercise.
Autonomous Locomotion with JavaFX UIs
In 2017 and 2018, we started to develop more planning and perception capability, but using a separate UI framework from the DRC UI. We developed an A* footstep planner and a planar region segmentation perception algorithm that worked together. Using the MultiSense SL lidar scanner, we could wait for 20 seconds or so, pool an octomap terrain representation, segment it into planar regions, and plan footsteps over it. This facilitated a leap in robot autonomy. The robot was now able to traverse sections of rough terrain autonomously, as shown in 2 and 3.
These planning and perception features departed from the DRC UI, opting to use JavaFX for the algorithm visualization. This meant that there was now a separate UI framework for algorithm development and robot operation. A complication was that JavaFX could not render point clouds efficiently like the DRC UI could, causing somewhat of a rift when deciding how to visualize robot data.
Navigation with Planar Regions
Around 2019, we started to develop room-to-room navigation tools in the JavaFX ecosystem (4), while also starting the development of a rough terrain traversal behavior system that used active perception for each footstep. It was in 2019 that the work on this thesis started directly. One aspiration was to try to unify the tools used for robot operation and planning and perception algorithm development. The reason this was important was to simplify the decision of what tools to use when developing a feature and also reduce code duplication. For example, the JavaFX UI could not render point clouds efficiently and it was cumbersome to create UI widgets. Conversely, the DRC UI code was aging and it was time to evaluate a more modern set of building blocks while cleaning up the code.
Another issue with the DRC UI was the friction in creating standalone applications. When developing planners and perception algorithms, it is desirable to create a test or demo app just for that, while maintaining the usability of those components in larger apps, such as algorithm visualization elements.
Virtual Reality Teleoperation
Around this time, we had also developed some virtual reality (VR) applications based on the DRC UI, which allowed for teleoperating the robot. This virtual reality interface made authoring one-time-use footstep plans really fast. We ran a demonstration where the VR operator was able to manually place footsteps over a rough terrain cinder block field in the range of a few minutes. One caveat to this implementation was that the operator had to either launch the 2D DRC UI or the VR one and could not switch between VR and mouse and keyboard modalities without restarting the user interface. Restarting the user interface is a frustrating thing to have to do because you can lose any persistent state you may have in setting up robot commands. Additionally, it can take several minutes to relaunch the user interface, so it can really start to eat up time.
Simulation Construction Set
We also had a simulation and data visualization engine called Simulation Construction Set (SCS), which allows users to programmatically create rigid-body dynamics simulations and robot controllers. On top of that, it featured a 3D view of the robot and scrubbable plots of buffered variable data, as seen in 5. SCS was implemented using an older version of jMonkeyEngine and features were not interchangeable with our DRC UI. For UI widgets, SCS used Java Swing, but using a very different code path than the Java Swing used in the DRC UI. The takeaway here is that while SCS and the DRC UI used the same underlying libraries, functionality could not be shared between them.
2020-2021 Atlas, RDX, Behavior Tree Era
Evaluating Next-Generation Tooling
A Slack message on December 18, 2020 wrote: “[…] because of the graphics adapters for 2D and 3D you can’t easily use the underlying engine features and you can’t share graphics code with even our other JME apps like the DRC UI. […] The JavaFX UIs, the (DRC) Operator UI, and SCS are basically not even close to being compatible with each other.”
In late 2020, we did a survey of software libraries in an attempt to build a framework that could act as a sandbox for robotic data exploration and planner development but also be extendable to state-of-the-art robot teleoperation. Additionally, since the promise of VR was now clear, we wanted to include VR support as a base function in all applications, such that switching between mouse and keyboard and VR was as simple as enabling VR and putting on the headset. We wanted the 3D scene to be shared between VR and the mouse and keyboard interface and treat VR as just another interaction method in the same class as monitor, mouse, and keyboard.
Another desirable characteristic of this new framework was ease of creating user interface widgets. Often for designing, tuning, and interacting with experimental planners, controllers, and perception algorithms, the engineer wants to tweak variables quickly and easily to see the effects on the process and the data. This is in stark contrast to considering user experience for every UI element. In our survey, we found a library called ImGui which fit this bill perfectly, allowing user interface elements to be rendered in application code, in the same place as the interaction logic is handled, and in very few lines of code. It turned out to be a great way to provide base functionalities quickly while still allowing for custom rendered widgets that consider user experience.
Another goal of the new framework was to stay within the current software ecosystem so lab developers could seamlessly use it and contribute to it and so that existing lab tools could be used and not rewritten. Our existing code was in Java so this meant we wanted to stick with Java. Our thought was also to build on the most popular open-source tools in the community in order to avoid any rug pulls and also ride the wave of community maintenance and development.
For 3D graphics engines, there were two major competitors: jMonkeyEngine and libGDX . The DRC UI was implemented using jMonkeyEngine but was on an old version. libGDX had more stars on GitHub, a vibrant community, and an API that was closer to plain OpenGL, the underlying library. Because we needed to maintain the DRC UI usability as a legacy app and it was somewhat tangled in an aging codebase, it was unclear if bringing it directly forward to modern libraries would work.
When designing the new framework, a few things were clear:
-
We wanted to adopt the pattern used by Eclipse IDE , where the application was a set of dockable panels. These panels should be showable and hideable, and the docking configurations should be easily savable and loadable. This system of UI design was shown to be very adaptable to different workflows and tasks. It is also similar to how other popular and versatile apps work, such as Blender .
-
We wanted to build in virtual reality (VR) support at the base level. If you didn’t have a VR headset, the app would operate normally. However, if you did, there was a checkbox to enable VR and you were able to jump into the main 3D view and interact with the same elements you could with mouse and keyboard. An element of the VR support was that the 3D view on the monitor would remain visible while in VR, and others could view the poses of the VR user’s headset and controllers in 3D by standing next to the computer.
-
We wanted to implement the main features of the DRC UI such as the gizmos, the walk path control ring, modifiable footsteps, and affordance templates.
ImGui included support for the dockspaces of panels, multi-window support, and even saving those entire layouts to .ini files. OpenVR was available for Java and supported any VR headset that would work in SteamVR, which is nearly all of them. We decided to rewrite the new framework from scratch and use libGDX, ImGui, and OpenVR.
Robot Data eXplorer
This effort ultimately culminated in the robotics sandbox application suite later named Robot Data eXplorer (RDX), shown in 1. We wanted this set of tools to be useful for teleoperation, simulation, and perception and control algorithm development. It was designed in a similar spirit to SCS in that the user would first create a main entry point for their project and implement what they needed programmatically. For example, the application could be a test of a point cloud renderer, displaying it in 3D while also providing tuning widgets for point size, color, etc. We also wanted RDX to be versatile enough such that our primary applications could use it, such as our teleoperation interface (like the DRC UI). This meant that it needed to support not just utilitarian development widgets, but also custom rendered icons and interactive elements. ImGui and libGDX supported all of this and we now use RDX in 2026 as our primary teleoperation interface.
The DRC UI mainly supported keyboard controls for the gizmo. In RDX, we wanted to make it more intuitive to use by implementing click-and-drag controls for both orientation and translation. We implemented a mouse ray projection into the 3D scene in RDX which was generally available to any component in RDX. For the gizmo, this was used to intersect with the gizmo orientation control tori and translation control cylinders and cones. The gizmo’s axes are colored as Red, Green, and Blue, to match X, Y, and Z, and Roll, Pitch, and Yaw. The way to remember it is “RGB -> XYZ”. We also re-implemented the walk path control ring and virtual interactable footsteps, as shown in a 2022 screenshot in [fig:nadia_interactables].
The VR mode in RDX also featured ImGui widgets. The same panels available in the monitor, mouse, and keyboard modalities were also available as panels mounted on the wrist in VR, and they were operable via point and click with the VR controller. This continued the RDX design goals of reusability and versatility into the VR modality. We wanted VR to be simply another interaction method for the same data, not a separate framework. Additionally, we wanted switching between the monitor, keyboard, and mouse and virtual reality to be as simple as putting on and taking off the headset and picking up and setting down the controllers.
Perceptive Locomotion
In 2020, we started using the RealSense L515 depth sensor for perceptive locomotion. Bhavyansh Mishra developed a planar region extractor using OpenCL which enabled the development of a perceptive locomotion behavior with active perception while walking. This led to one of the first autonomous behaviors that used RDX as the user interface, as shown in 2. This work is covered in the 2022 paper .
Behavior Trees
After the initial development of the perceptive locomotion behavior, “look and step”, we sought to build out multi-task behaviors as part of our building exploration demo. To assemble a hybrid teleoperated and autonomous multi-stage behavior, we looked to behavior trees. 3 shows the very beginning of that effort in May of 2021. We pulled in an ImGui library called ImNodes and started rendering a tree of behaviors in a 2D canvas.
In June 2021, we worked on a building exploration demo which consisted of seven tasks with a hybrid autonomy model: an autonomous rough terrain traversal using perceptive locomotion, an automatic pull door behavior, VR teleoperated debris clearing, and automatic push door traversal, obtaining a mock pipe bomb from the top of a flight of stairs, walking back down, and placing the mock pipe bomb in a trash can. The tasks are shown in 4. We successfully performed the demo on June 23, 2021, in 22 minutes.
This demo tested our ability to incorporate autonomous behaviors with operator supervision and teleoperation. Major innovations in the demo with respect to the DRC were VR kinematics streaming, continuous perceptive locomotion, and automatic push and pull door behaviors. However, the behavior structure for the locomotion, the door traversals, and the building exploration assembly were still hard-coded and not runtime editable, and perception of semantic objects was still reliant on fiducial markers. Push and pull door and stairs behaviors were selected by the fiducial IDs. The UI for this demo can be seen in 5.
After the building exploration demo, we switched from the MultiSense SL to the Ouster OS0-128 lidar sensor in combination with a stereo pair of Blackfly cameras with fisheye lenses. A paper about this sensor suite is available at . With the new sensor configuration, we had instantaneous dense lidar scans instead of needing to wait for the MultiSense SL to spin. This enabled us to develop the person following behavior shown in 6.
2022-2023 Nadia, Runtime-Editable Sequences Era
Runtime-Editable Sequences
In 2022, inspired by the Affordance Template framework, we started developing an interactive and runtime editable behavior authoring pipeline. This was the beginning of the architecture we still use in 2026. The initial goal was to replicate the door behaviors we had before, but in a runtime editable way. To do this, the behavior state was implemented as a synchronized Conflict-Free Replicated Data Type (CRDT). This allows the operator and robot to share and co-modify the behavior action definitions and run modes. A screenshot from an early demo can be seen in 1. This version of the behavior system had 4 action types: Walk, Hand Pose, Hand Configuration, and Chest Orientation. The Walk and Hand Pose actions were defined in task frames. It saved and loaded behaviors to and from JSON. However, it only supported a linear sequence of actions and they were only executable one at a time.
Milestone 1
Also in 2022, we switched to IHMC and Boardwalk Robotics’s Nadia humanoid robot and set some milestones for the next few years. Milestone 1 was a teleoperation-only multi-task demo. Milestone 1 was performed on the robot on September 9, 2022 using the RDX teleoperation UI, using an upgraded set of teleoperation features shown in 2. In that demo, Nadia traversed rough terrain, traversed a push door, cleared debris, and traversed another push door in 7 minutes. The tasks for Milestone 1 can be seen in 3. During this demo and with this set of functionality we found that some methods of teleoperation were faster in VR and some were faster with mouse and keyboard. RDX supported switching between them quickly. We used mouse and keyboard to place individual footsteps over a cinder block field and then switched to VR for debris clearing and the push doors, where we used the VR controllers to continuously stream the hand poses and the VR controller trigger buttons to open and close the hands. VR was also used to drive the robot with the joysticks.
This user interface also supported multiple operators as a side effect. Often, the VR operator would have another operator observing the monitor and sometimes clicking buttons to put the robot in pre-defined arm configurations, like the home and door avoidance. This helped us overcome gaps in the VR interface quickly and allowed for “copilot” supervision and monitoring, which doesn’t usually work well for a VR system.
Behavior Nodes
A primary goal of the Milestone 1 demo was to get base functionalities in place with the plan to start adding autonomous components in later milestones. To that end, in 2023, we developed the next generation of behaviors, which were entirely authorable at runtime. We started with a can pick and place behavior and the push and pull door traversals.
On June 20, 2023, we executed a can of soup pick and place behavior in 1 minute and 50 seconds as shown in 4. This behavior was slow, not autonomous, and used an ArUco marker instead of detecting the can directly. Our control of the SAKE Robotics EZGripper in Nadia at this time was unreliable and there was no fallback node yet so the operator had to keep trying the gripper open and close actions. The operator had to visually check the state of the gripper fingers and re-execute the action if necessary. There was no way to get the state of the gripper from the user interface at this time. The operator also had to wait for each action to complete before starting the next because action completion conditions were not finished. Regardless, it was a good real-robot test of a savable/loadable and runtime editable affordance template behavior sequence.
Our first autonomous push door traversal on Nadia using this system was on June 27, 2023, as shown in 5. UI improvements were made to condense each action node’s settings into collapsible areas so we could display larger sequences. Several new action types were added: hand wrench, pelvis height, arm joint angles, manual footsteps, and wait. We were able to autonomously traverse the push door in 36 seconds with this version of the system. The behavior was based on the pose of an ArUco marker detection. In this generation of behaviors we used the Ouster point cloud colored using the left Blackfly fisheye camera image for situational awareness in the operator UI.
The hand wrench action was used for a box pickup behavior which didn’t end up working very well. The wrenches were used for squeezing the box from the sides to hold it. However, the box wasn’t fully constrained and would wobble. Later, we decided to change the box picking strategy to using the grippers to pinch and lift from the top edges of a tote.
Behavior Tree Structure
6 shows an improved version of the push door behavior executing on August 22, 2023. In this version, there are several enhancements to the behavior user interface. The first person view was included in the interface, allowing the operator to monitor the robot’s view of the door and the hands without needing line of sight. Other enhancements included 3D force and torque indicators on the hands. The idea behind this was to experiment with force based control by first taking readings on the current measurements. The force and torque data was pretty noisy and we weren’t sure how to handle it. Another enhancement was the action progress bar indicators. This allows the operator to monitor the execution progress of the actions and give the operator confidence that action nominal duration and termination conditions were working correctly. The speed of the behavior was the same as the June run at 36 seconds.
In September 2023 we made enhancements to the walk action as seen in 7, allowing the behavior author to interactively edit a multi-step footstep plan. This was especially useful for authoring door traversal walk-throughs, as a very specific set of footsteps are required to satisfy controller stability limits while avoiding collisions between the robot’s shoulders and the door frame.
Also in September of 2023, we established the code structure for the behavior nodes as seen in 8. This structure separated the concerns of a behavior node into four layers: the definition, the state, the executor, and the user interface (RDX) implementation. Each of the four layers was implemented in a separate code file to keep similar code in similar files. The definition layer was responsible for defining and serializing the data that defines the node to and from JSON files. The state layer contains the data that only exists once the node is instantiated and is common between the UI and executor implementations. It can also contain any data helper functions needed in the UI and the executor. The executor layer is the instantiated type of the node on the robot. It is the only layer that commands the robot directly through its ROS 2 API. It is also the only layer that has access to full resolution and frequency perception data, as it is co-located with the perception sensors. Finally, the UI layer is for rendering the authoring interface elements in RDX. The UI layer is also contained in a UI-specific folder that has access to the graphics engine and widget libraries.
In November 2023, we started to convert the sequence into a tree, as seen in 9. Here, the root node “WalkingTest” can be seen at the top with an expanded arrow and the root node control widgets. Two child walk action nodes can be seen below it.
Further enhancements to the behavior system were made in November of 2023. As seen in 10, a green arrow was introduced to highlight the currently selected next node to execute. This created a uniquely identifiable marker for the operator to focus on that was distinct from the default ImGui widgets. Clicking the green arrow is used to set the index. Later the arrow would blink blue to signal the action was currently executing.
11 shows a context menu feature used to reorder nodes and place them under different parents. This was especially useful if a node needed to be moved far away. Since switching to the tree-based view, this was the first time the nodes could be reordered or moved under new parents.
A feature developed on November 16, 2023, can be seen in 12, which shows how a node can be converted into a nested JSON file via the context menu. The JSON behavior loader now supported recursively loading behavior trees with nested JSON files. This supported the separation of skills into their own JSON files and kept larger tree files from getting too large.
In December of 2023, we developed a screw primitive action to open doors, which can be seen in 13. This parameterized helical screw trajectory option was inspired by . It allows the operator to define revolving motions useful for turning handles and swinging door panels. It is defined by a 3D axis (dashed white line), an angle of revolution, and a translation amount to move along the axis. The trajectory is also grasp-invariant, as it generates the trajectory online from the hand’s pose just before execution.
2024 Nadia, Semantic Perception, Fast Behaviors Era
Can Pick and Place
In January 2024, in an effort to demonstrate advancements in manipulation capability, the behavior system was used in a demo that picks up a shoe off a table, sidesteps, places the shoe in a banker’s box, picks up the box, and backs away with it as shown in 1. This behavior was authored by Dexton Anderson and Dhruv Thanki and was a good example of the behavior system being used by someone other than Duncan Calvert or Luigi Penco, who had been the primary users until this point. The behavior was executed using manually stepping as opposed to the automatic mode and lasted 3 minutes and 9 seconds. This was also the first integration of a CenterPose model for the shoe and the first behavior that did not use an ArUco marker for manipulation of an object. The behavior was split into three separate JSON files which were executed separately: “LeftHandShoePickUp.json”, “ShoeDropOff.json”, and “PickUpBankersResizableBox.json”.
This behavior was very difficult to get to work due to the unreliability of the components. The Nadia T-Motor arms and lower body EtherCAT bus would often fault, causing the robot to fall. CenterPose was very unreliable in detecting the shoe. There was no onboard compute so a laptop was tethered to the robot to run the perception and behavior processes. On top of those, the behavior system was still new and buggy. We also had a new implementation of a scene graph that worked with the behavior system, but separate. This scene graph had synchronization bugs.
Behavior System User Interface
Also in January 2024, a major refresh of the behavior system user interface was completed as shown in 2. The general visual improvements in this version are still present in 2026 at the time of writing. The major changes included the following.
-
Icons were added for the action types to aid in visual navigation.
-
Actions could now be renamed by double clicking them, typing the new name, and hitting enter.
-
The settings for each node were moved to a modal area at the bottom of the Behavior Tree panel instead of being inlined next to each node. This was an important change that simplified the user experience, as only one node should be modified at a time.
-
The root node selection area was refreshed visually and used selectable underlined text buttons.
-
Action progress was now plotted at the top as an option instead the progress bars. The user could toggle between these two types of progress information.
On February 4, 2024, we achieved the fastest push door behavior yet on Nadia in 17 seconds, shown in 3. By this time, Nadia’s upper arms had been upgraded to a cycloid drive based design. It also used an “execute with next action” boolean option for action nodes which was added back in September 2023. The “execute with next action” feature supported executing multiple nodes at once, contributing to the speed of this push door behavior. This action sequence node would not wait for the currently executing action to complete if this field was true for the next action. Instead, it would immediately execute it too, such that multiple actions could be running at once.
Also in February 2024, we added a feature that shows an asterisk next to the behavior node names if they have been modified, as shown in 4. This was an important usability feature in behavior authoring, because accidental behavior modifications were common at this stage in the implementation. It was, therefore, important to understand when a modification was made so the operator could undo it if necessary. This feature involved some extra boilerplate in each node’s definition implementation but was well worth it for the usability improvement.
Execute After
In March of 2024, the “execute with next action” concurrency feature was replaced with an “execute after” node pointer. This allowed each action to declare a dependency on which action to wait for. By setting the “execute after” field to something earlier than the previous action, actions could start along with prior actions. This also allowed for action scheduling using the Wait action in combination with the “execute after” field. For example, arm motions could be scheduled during walking. This feature enabled our fastest ever door traversal, performed on March 15 2024, as shown in 5. In this behavior, Nadia traversed a spring-loaded push door without stopping walking in 14 seconds, using nubs for hands and an ArUco marker for detecting the pose of the door.
Composite Frame
On April 9, 2024, we demonstrated a bimanual box pick and place behavior as shown in 6. With some padding attached to the nub forearms, Nadia executed a behavior that picked a box up from a pallet on the ground, carried it to a table, set it down, and backed away.
On April 12, 2024, we demonstrated a pull door behavior in 19 seconds, shown in 7. This behavior used a new YOLO model for door opening mechanisms and did not use an ArUco marker.
Earlier, in late March 2024, we had started implementing a heuristic door traversal node as a place to hard-code door-specific logic while continuing to use the runtime editable actions. The purpose of this was to implement reactive elements like retries on failure before figuring out how to generalize it with fallback nodes. Pseudocode for the initial pull door lever retry is presented in [alg:pull_door_screw_primitive_retry]. This algorithm didn’t work as reliabiliy as the one we will present next.
On April 12, 2024, we used this strategy to demonstrate a reactive version of the pull door behavior, as shown in 8. A human repeatedly holds the door closed, pulls the door closed, and pushes the robot’s hand back as the robot tries to open the door. The door traversal node detects the action failure, rewinds to the grasp approach action, and tries again until successful.
An improved hard-coded heuristic, presented in [alg:door_traversal_node_reactive], was used to detect three types of failures specific to opening a door. At the end of the grasp action, if the hand did not reach the handle, we assume the grasp failed. At the end of the pull door open action, we check the pose of the door handle and the pose of the robot’s hand. In the case the handle did not move far from its original position, we assume that the unlatch failed. In the case the handle pose is far from the hand pose, we assume the hand slipped off. On any type of failure detected by this node, we rewind the action sequence to the grasp approach action, where the robot immediately begins to retry the door opening. This approach worked well for the human holding the door shut, pulling the door out of the robot’s hand, and pushing the robot’s hand away from the door handle. In all three cases, the robot was able to recover and successfully execute the full door traversal without operator intervention.
During this demo, we noted the friction with hard-coded reactive logic. Iterating from [alg:pull_door_screw_primitive_retry] to [alg:door_traversal_node_reactive] took a lot of time and guesswork. Our metric for evaluation was the reliability of the reactive retry mechanism. Since this test required the full system to be online and two people to operate, one at the operator computer and one to disturb the robot, experiment time was scarce. Having to change code, implement debugging output, redeploy, and restart the behavior to modify the reactive logic sucked up even more of that time. It was clear that a fallback node, proximity condition, and improved action success criteria would streamline this process in the future.
On July 3, 2024, we wanted to show reliability and behavior composition in a building exploration type environment. We put together a behavior to traverse three doors of different types consecutively in one continuous automatic execution. We built three lab doors for this: a door with spring closer with a push bar on one side and a pull handle on the other, a knob handle door with a spring closer, and a lever handle door without a spring closer. We traversed the first two doors in 48 seconds but fell after opening the third door. This run is shown in 9.
There were a couple of failure modes that made achieving this behavior difficult. Firstly, the EZGripper gripper, having only two fingers and limited grippiness, resulted in slipping off the pull lever handle very often. Additionally, having only two fingers made getting a solid grasp on the knob handle difficult. For example, if the two fingers are not positioned at opposite vertices of the circular knob handle, it would slip off to the side. This was a key moment in realizing that having at least three fingers would make door behaviors more reliable.
Another failure mode, and the one that ultimately prevented us from getting the third door completed, was the result of a combination of hardware and control issues. The Nadia robot was tethered to a hydraulic pump across the room and the hydraulic cabling was very stiff. Also, Nadia’s hip pitch actuators in their configuration were barely strong enough to lift the leg when the arms were also out in front of the robot, as required for door traversals in not colliding with the door frame. When the robot was positioned to traverse the third door, the hydraulic cables put enough extra torque on the robot to make walking through the door un-achievable for the robot and it repeatedly fell.
Later that month, we attempted an extended version of that behavior in which we turned the behavior into a multi-room search. Dubbed the “ONR Demo”, this behavior was demonstrated on July 19, 2024. A test run is shown in 10 which is a partial version. At the time of writing, we have been unable to recover video of the full run, but we recall having searched all three rooms, including moving the couch to check under it and performing a salute behavior in the third room.
We have printed the structure of the behavior, generated from the actual JSON in 11 and 12.
On the day of the ONR Demo demonstration, we were able to accomplish each task without a failure. However, in the run-up to the demo, we noted an important property of behavior composition. Our full demo was composed of seven main loco-manipulation behaviors:
-
Traverse a push bar door.
-
Move the recycling bin out of the way.
-
Traverse a push knob handle door.
-
Traverse a pull knob handle door.
-
Traverse a push lever handle door.
-
Traverse a pull lever handle door.
-
Move the couch.
Our observation was that even if each behavior had a success rate of 90% (which they didn’t), the full demo reliability is calculated as $$ 0.9^7 = 48.7% $$ which is an alarmingly low (less than 50%) full-demo success rate even when the sub-behaviors have achieved very good reliability. On the day of the final demo, where we were ultimately successful, our success likely hinged on sufficiently configuring and testing the initial conditions such that the chance of success was much higher.
2024-2026 H1-2, 5 Fingers, Manipulation Era
AI2R and Fallback
In late 2024, we started looking into integrating our behavior tree system with external AI systems, such as Large Language Models (LLMs) and Vision Language Models (VLMs). For this, we created another heuristic control node called the AI to Robot (AI2R) node. In a similar spirit to our door traversal node, this node would contain hard-coded interfacing to external AI processes. The goal was to use LLMs and VLMs to generate behavior and make decisions. By June 2025 we had a working demo in simulation, as shown in [fig:2025_ai2r]. The goal of this demo was to receive an object from a person, stick it to the door, and retreat.
In this demo, the robot scans the area and reports the detected scene objects to the LLM, including people. The LLM is given a list of objects and poses, works out which person has the object, then tells the robot to go to the person with the object. There is also failure handling. If the robot goes to the wrong person, the one who is not holding the object, the LLM will detect this and an operator can, in natural language, correct the behavior to go to the other person. The system also supports collision avoidance. In the case a person is blocking the robot’s path, the behavior will pause and the operator can, in natural language, ask the robot to go behind the last person.
This hybrid behavior system shows that it is possible for LLMs to work in place of some classical components. However, it has not yet been shown to be faster, more reliable, or more capable than what is possible with classical methods.
Behavior Tree and Scene Graph
In late 2024 and early 2025 we had a series of design meetings focused on two major areas. The first was how to extract the reactive logic from the door traversal node into a runtime-editable version that would extend to other loco-manipulation behaviors. The second was how to resolve reliability and code architecture issues we were facing between the behavior tree and the scene graph. Whiteboards from these meetings are shown in 1.
We had been looking to Behavior Trees (BTs) in the literature for some time, and had used an implementation of them in the 2021 building exploration demo on Atlas, as we showed in [fig:2021_building_exploration_demo]. However, that 2021 implementation did not include runtime-editable behaviors. We wanted to figure out how to support behavior tree control nodes with our novel implementation of runtime-editable concurrent-capable action sequences. We decided to look for small, incremental modifications that made the biggest impact on reactivity.
In the above design meetings, we decided that adding a fallback node, a condition node, and a goto node would take us a long way without having to do a more significant redesign. Our idea was that our existing design of “everything is a single sequence” would still work and that the sequence implementation would monitor the fallback node, condition node, and goto node children and manage the index of the next action to execute accordingly.
In December 2024, we added a fallback node that supported concurrent actions. It has two sections: a “try” and a “catch”. It is implemented such that the first concurrent sequence in the ordered children represents the “try” and the rest represent the “catch”. The “try” can be one or more actions and the fallback node waits for them to finish. If one or more “try” actions fail, the entire “catch” sequence executes. There is no limit on the “catch” sequence length. Else, if the entire “try” is successful, the “catch” is skipped and the node after the fallback node will be executed next.
In January 2025, we added a goto node. This node simply holds a settable reference to the next node to goto. It stores this reference simply as a node name in the JSON. When executed, it sets the “next execution index” accordingly.
For a while, actions were simply placed as the “try”. In this way, if an action failed, a goto could be placed in the catch, pointing back to the try, to retry the action. For example, if a hand pose action does not achieve its goal pose within the tolerance setting, the node would fail. This already effectively made a certain amount of reactivity possible. A push door opening action could react to an unlatching failure through the “push door open” action failing to achieve the pushed open pose.
However, it would be better to perform perceptual checks than to attempt and fail physical actions. For example, when pulling a door open, a slipped grasp would not cause the hand to fail to reach its goal pose. Instead, it would be better to detect that the door did not open with the hand by observing the point cloud.
It took longer to implement a condition node that had a meaningful impact on behavior capability. In February 2025, a simple counter condition node was implemented to execute loops with iteration limits. This was ultimately not very useful.
In March 2025, we implemented an LLM condition node. It was based on the Llama 3.2 3B Instruct model with 8-bit quantization. As shown in 2, we replicated the counter condition node type with the LLM type. After experimentation, we found that the LLM’s response format varied. To create a simple binary pass/fail as required by the condition node, we structured the node as three parts: the system prompt, a repeating prompt, and a response matcher. The system prompt would be input to the model once and the repeating prompt would be input to the model each time the condition node is executed. The response matcher is implemented with a regular expression that either matches or does not match the output from the LLM. The behavior author is provided with an option for whether a match means a success or a failure of the condition node.
This worked okay, but we had trouble with such a small LLM model generating meaningful results. This development did not result in any meaningful impact on the capabilities of loco-manipulation behaviors. Our focus quickly drifted to approaches that queried paid online VLM models via the AI2R node.
For a seven month period from March to October 2025, we tried to get Vision Language Action Models (VLAs) working for task sub-components like the door opening
, but ultimately failed and gave up. Our conclusion was that at this time, VLAs were not an “off the shelf” solution and still required expertise and an “artistic” talent to get working well. However, it may have also been that we had a bug somewhere.
In October of 2025, we did some restructuring of the internals of the behavior system, making the root node reference final and accessible directly to all nodes. This allowed more direct access to behavior-wide components. For example, for all executor nodes, the controller ROS 2 node, the behavior scene, and the root node were all made final and accessible directly without needing to navigate boilerplate. This made the code a little nicer to deal with but didn’t have any meaningfull effects on operator usability or robot performance.
In late October of 2025, we finally acted on the March 3rd, 2025 design meeting, shown in 1, and had another meeting shown in 3. At this time we introduced a behavior scene that is owned by the behavior system alongside the tree, as opposed to relying on an external scene graph system. The October 21, 2025 meeting is where we solidified the concept of a scene action node. We pulled the instant and persistent detection components from the scene graph and left everything else. We referenced the detection manager that was part of the scene graph and created a condensed and revised version.
Instant detections are immutable objects created from each run of the YOLO and FoundationPose models. They contain all the information about the detection such as the semantic object class name and the timestamp. Instant detections are extended by the type of the originating detector. For example, the YOLO instant detection contains the YOLO specific information, like the region-of-interest coordinates, the confidence value, and the segmentation image mask. The FoundationPose instant detection contains the pose of the object and the 3D bounding box corners.
The YOLO instant detections also pull in and store the corresponding depth points from the ZED X Mini depth image. This feature required impressive technical wizardry from Tomasz Bialek and Dexton Anderson to maintain performance and avoid memory leaks. However, this capability of associating 3D depth data with semantic detection masks is essential to making perceptive behaviors work. The association gives a high frequency 3D position of the centroid of the detected object, which is usable to act on constrained objects like door handles and symmetrical objects like tennis balls.
Behavior Scene
Persistent detections are simply a collection of instant detections over time. An early demo is shown in 4. The new behavior scene implementation carried over a similar logical structure to the detection manager that was in the scene graph. This algorithm sorts incoming instant detections into existing persistent detections or new ones based on location. If an incoming instant detection is the same semantic class as an existing persistent detection and is close by, it is sorted into that one. If not, a new persistent detection is created. This system implements persistent object tracking which can be used to track objects for use in our frame-relative behaviors/affordance templates.
Another innovation in November of 2025 was the introduction of the scene action. Now that we had an active list of persistent detections, we needed a mechanism for the behaviors to select one and use it. We had a list of issues we didn’t know how to fix from the prior scene graph implementation. One was that persistent detections would get distorted once they got occluded, such as what happens when grasping an object. The hand starts to cover the object and the centroid of the segmentation points moves away from the hand, corrupting the prior position estimate. Another issue was in object selection. When there are multiple instances of the same class of object, which one do you choose?
Both of these issues require careful coordination with the behavior in order to solve and are context dependent. So, we decided that maybe it was best if the responsibility to handle these was just handed off to the expert operator. With that thought in mind, we decided to make the tools that made it possible for the behavior author to handle these issues. This led to the idea of an authorable scene action, which would have distinct types to do different scene related things.
To solve the object selection issue, we provide a “setup object” scene action type which, when executed, chooses the closest persistent detection of the defined semantic class and puts that persistent detection in a privileged area called the list of “objects”. This privileged list of objects is where attachments to the behavior actions are made. Any object’s pose in this list is immediately available to all behavior actions as a reference frame that can be used to define actions relative to. This implementation limits the list to contain one instance of a particular semantic class at a time so behavior actions can be defined relative to the class of object – not the specific instance of one. Another property of the “objects” list is that the persistent detection does not automatically expire. If instant detections stop coming in for it, it merely becomes stale, indicated as greyed out in the UI, but the frame is still available for the behavior actions. This is actually an important and intentional feature, as this can be used to “dead reckon” actions with respect to something you perceived before but can no longer see. For example, for door behaviors, the traversal walk through is actually defined with respect to the stale opening mechanism reference frame. We planned to have selection heuristics for the “setup object” scene action type, but haven’t yet ended up with a burning need for anything other than closest.
To solve the hand-object occlusion issue, we gave the operator a “freeze object” scene action type. This allowed a “freeze” to be performed on a scene object just before grasping it. For example, a scene object can be set up, the hand moved towards it, but not occluding it, while still tracking the object, and a freeze object action can be executed just before finally occluding the object. The freeze operation has an effect equivalent to the object becoming “stale” as described earlier. The object’s frame, similarly, can still be used for physical action definitions.
Ability Hand and Reliability
A demonstration on November 10th, 2025, showed the utility of the new behavior scene and scene action node. This was the first time we picked up an object from a table in the fully automatic mode, as shown in 5. It was also the first manipulation behavior we executed on the Unitree H1-2 robot we got that year. There were still lots of little bugs and issues in this demo, but nothing super theoretical. For one, our whole body controller was not great at walking with the H1-2 so the approach stance was not accurate, resulting in a low quality IK solution for reaching the mustard due to not approaching the table closely enough. A low quality IK solution with our system has always meant a high error bar on the resulting hand pose. Another of the biggest issues was that our finger control was unreliable. We ended up fixing that on the hand controller side in the coming months.
On November 19, 2025, we generalized an initial implementation of the proximity condition type, as seen in 6. In the bottom left of the screenshot, the options can be seen. The proximity condition type compares the positions of two reference frames. In the case shown, frame A is selected as “afterPelvis” (the pelvis frame) and frame B is selected as the mustard object, which we’ve instantiated and posed virtually in simulation. The distance type field selects the comparison criteria: XYZ, which compares the pure Euclidean distance, XY, which compares the distance as projected onto the XY plane, and Z, which simply compares the height of the frames. A min and max distance is specified to define the success criteria. When the condition executes, it continues executing until the criteria is met (success) or the timeout is reached (failure). The current distance is also presented in the operator interface to help with authoring. Observing the current values can help the operator reason about which min and max values are best.
Later in November, we added a visualization element for the “execute after” field for action nodes, as shown in 7. Prior to this feature, to understand the concurrency of sequences you would select each action in the list and check the “execute after” field. You could also click the little green arrows and see if the ones after it were also green, indicating that they would execute together. However, we wanted to reduce uncertainty and increase understandability of the behavior at a glance. We came up with the arrow design, seen in the figure, which starts at the defining action and extends upwards, pointing to the action specified by the “execute after” setting.
On December 1, 2025, we conducted our first behavior shakeout on IHMC’s new fully electric humanoid robot, Alex. We would soon transition behavior development fully to our new robot. The behavior shakeout consists of a sequence of open-loop walking and arm motions to test the system’s basic mechanics.
Later in December, we implemented behavior data logging in a way that is synchronized and viewable alongside whole body controller data. In this system, the behavior system process sends a periodic ROS 2 message containing a registry of behavior YoVariables, the buffered data types supported in SCS. A subscriber is set up in the whole body controller process to subscribe to these variables and inject them into the whole body controller’s registry, which then gets logged in the normal way. A screenshot of a loaded log in SCS is displayed in 8, showing the behavior variables embedded in and synchronized with the whole body controller log.
On December 11, 2025, we integrated FoundationPose into the behavior scene. We used this to create an improved version of the mustard pickup task, as shown in 9. This added the capability to grasp asymmetric objects as FoundationPose detects an object’s orientation and bounding box in addition to the position. The red box in the figure shows the FoundationPose bounding box. In the bottom right, the FoundationPose scene object for the mustard can be seen. At this point we had two types of scene objects: FoundationPose and YOLOv8. Our scene was now designed to handle different types of objects abstractly and be extended easily to support more types.
This demo was still somewhat unreliable due to the low level control code of our new 5-finger Psyonic Ability Hand grippers. The hands would often not respond to behavior commands and, worse, retrying the command did not work. By December 15, after trying at least five different control strategies, we fixed this issue and finger control was very smooth and reliable. The on-board Ability Hand controller wants an alpha-filtered position setpoint input that is generated from an ideal velocity-limited trajectory, as presented in [alg:ability_hand_pos_vel_limit]. This helps the hand reliably move at the desired velocity to the desired setpoint and hold the position without jittering.
\(f_c \gets 1.0\) \(\tau \gets \frac{1}{2 \pi f_c}\) \(\alpha \gets \frac{\Delta t}{\tau + \Delta t}\)
Later in December, we added hand and finger previewing in the behavior editor, as shown in 10. This feature was designed to help the operator understand where the fingers will be during authoring. When a hand pose action is selected, the finger previews show up at the end of that arm pose. Since the pose of the arm and the configuration of the fingers are specified in separate actions and can be interwoven with other actions, this feature looks backwards in the sequence to find the most recent Ability Hand action. This action is used for the finger preview. If there is no prior Ability Hand action, the finger preview will not be shown.
This feature could probably be improved in a few ways. First, we could also show the preview when an Ability Hand action is selected, utilizing prior arm actions for the arm and hand locations. Secondly, if an arm action is selected and there are no prior Ability Hand actions, we could use the current configuration of the hand for the preview.
On December 23, 2025, using the pinch grasp shown in 11, we demonstrated a pull knob door handle opening task automatically 23 times in a row, as shown in 12. On the 24th attempt, the knob was not turned enough, and the hand slipped off on the pull. This was our first return to doors in over a year. Using now higher-confidence and more robust YOLO models for the door mechanisms in combination with the scene actions and the 5-finger Ability Hand gripper, door handle manipulation reliability has drastically increased.
On January 2, 2026, a similar demonstration with a door lever handle achieved 32 pull door openings in a row, as shown in 13. This time, the behavior was not run until failure. It was just run until we got bored. Similarly to the pull knob behavior, we think the Ability Hand being grippy and in an anthropomorphic 5-finger configuration contributed to the reliability along with the scene action and the maturation and stabilization of our behavior system as a whole.
However, this demo was not issue-free. There was some kind of bug in the hand action unrelated to hand control that was causing the action to fail. In this case, simply ignoring the failure would work to get the behavior to succeed, because this finger curling motion only slightly increased reliability, but was not necessary. We were able to use a fallback node to keep the behavior going regardless of failure, as shown in 14, marking our first use of the fallback node in a real robot behavior. The use of the fallback node helped in this case not to overcome an environmental failure, but to overcome a bug in the code, which was not a use case we expected. We learned that the fallback node was also useful for working around bugs in the system, reducing the chance that we would need to stop the robot experiment and go fix code bugs.
On January 8, 2026, the scene action was extended to fully support our full library of FoundationPose and YOLO models as shown in 15. The operator was now able to select any semantic object type from either model architecture via drop down menus. We also extended the scene object library to include heuristic objects, such as the door panel, shown here.
On January 20, 2026, we implemented that door panel heuristic scene object, as shown in 16. It uses two YOLO persistent detections, one for the handle and one for the panel, and draws a line between the centroids to identify the orientation of the panel. This is possible with the assumption that door panels swing on a vertical hinge.
2026 Alex, Resilience, Adaptability Era
Alex Door Opening
On the same day, in a 33 minute authoring session, we demonstrated a stand-in-place door opening behavior on the new IHMC Alex robot, as seen in 1. This was the first autonomous manipulation behavior to run on the Alex platform. The process for getting this to work was essentially the same as the prior door opening behaviors on the Unitree H1-2, with the exception that we needed to point the head down, which wasn’t supported as a behavior action yet. Alex features a neck with pitch and yaw and the default neck pitch does not provide sufficient visibility of the manipulation zone in front of the robot. To perform manipulation, we need to pitch the head down by 30 degrees.
As presented in 1, the authoring of the first pull door behavior on Alex was a five day process spanning 11 hours and 10 minutes of authoring time.
On January 23, 2026, three days after the squared up stance opening behavior, we achieved the approach and opening portion of the pull door behavior, as shown in 2. This took a while to get working because Alex’s arms are shorter than Nadia’s were and the spine range of motion was more limited, meaning the same strategy did not work. Nadia’s spine yaw range of motion was greater than Alex’s +/- 30 degrees. Also on Nadia, we took a double support stance for opening that was further from the door, using the spine yaw and the longer arms. This allows Nadia’s pull door opening motions to be simpler and faster, due to the robot being farther from the door and having more space to work with. Part of Alex’s pull door behavior involved “sneaking” the left arm in to hold and pull the door open, which was complicated not only by space but also through trying to reduce risk of damage to the Ability Hands. These items contributed to Alex’s door behaviors being significantly slower than Nadia’s. However, we also think that, given some more time, we could speed Alex’s behaviors up significantly.
The next day, as shown in 3, we authored the sequence for walking through the door and ran the whole behavior. This one was not fully automatic, but it was the first successful door traversal on Alex. Some difficulty was encountered in the preparation steps for the final traversal steps. While taking these preparation steps, Alex had to hold the door open with the left arm, as this door had a spring closer.
When we first tried, we had falls for at least two reasons. One was that the whole body controller was not well tuned on Alex for walking with the arms out, as was required for holding the door open. Another was that when the robot would take a step toward the door while holding it open, if the holding arm is kept still, when the robot puts its weight on the stance foot, the upper body and arm shifts towards the robot and the foot swing would get caught on the bottom of the door, causing a fall. The solution was to put the arm farther out before or during the traversal preparation steps. This solution can be seen in the tree in 4 as the “Push door way open for foot clearance” and “Retry push door way open” nodes.
SHAPE_CONTAINS type selected.
The sphere radius, min points, and current number of points contained are displayed.
In the center, the 3D view shows the sphere, intersecting the door, with a red tint that indicates a high number of contained points.
A video is available at https://youtu.be/tbTrKuGGmqk.
The “shape contains” condition was implemented on January 27, 2026 and is shown in 5. This new behavior condition returns success if either a reference frame or some minimum number of points from the point cloud lies within a 3D shape. We supported just spheres initially which can be sized and placed with respect to any behavior frame, just like the taskspace actions.
Shape Contains and Freeze
For example, we can use it to check if the whole body controller actually achieved a commanded hand goal pose. Without a fallback node, this condition can be used to stop automatic execution by failing in a sequence. When combined with a fallback node, it can be used in the “try” to branch the sequence of execution. We use it in this way for the door opening retry mechanism, by placing the virtual sphere where the door panel should be with respect to the robot’s chest after opening. If there are no or few points in the sphere at that point, the fallback catch executes a goto node, returning the behavior back to door handle pre-grasp. Else, the behavior skips the fallback catch and continues opening and eventually traversing the door.
FREEZE_OBJECT scene action type.
On the left, a scene action can be seen in the tree named "Freeze door lever handle".
In the bottom left, the scene action settings area is shown with the FREEZE_OBJECT type selected.
In the center, the 3D scene has text overlaid reading "Freezing object: door_lever".
On the right, in the Scene panel, the door_lever object indicates "FROZEN", meaning its pose will no longer be updated by active tracking.
A video is available at https://youtu.be/dTM4Rw_912Q.
On January 29, 2026, we developed the planned freeze object scene action type, as shown in 6. As mentioned previously, the freeze action helps with manipulation by preventing a partial hand occlusion of objects while actively tracking. By freezing the object frame just before occlusion by the hand, we prevent a corrupted object frame during the grasp. Another use case for the freeze action is to dead reckon from the frame of an object we saw in the past, as we do for door handles and the door traversal footsteps.
On February 12, 2026, we had our first fully automatic pull door traversal on Alex, that is, there were no gaps in autonomous execution from start to finish. The first run traversed the door in about 45 seconds. In about 10 minutes of speed-focused behavior tuning, we were able to shave that speed down to around 30 seconds. This was done by reducing action trajectory durations, reducing wait times, and increasing the concurrency of robot motions.
As an anecdotal statement on the robustness of our Alex pull door behavior, we ran the pull door behavior from these runs again on February 17, without modification, and it worked the first try. We ran it again on February 18, and it had only one slight arm tolerance issue, which didn’t cause a fall, only a brief gap in autonomy, before completing the traversal successfully. This was a good indication that our behaviors worked independently of slight variations in environmental conditions, such as careful placement of the lab door and the natural lighting coming through the windows, which varies based on time of day.
On February 22, 2026, we authored our first push door traversal behavior on Alex in under 2 hours. A screenshot from the session is shown in 7 and the timeline of authoring is documented in 2. The push door traversal executed in 21 seconds.
Around this time, we started developing the Quick Footstep Planner to more reliably get footsteps for task approaches on flat ground. This footstep planner uses a procedural geometric heuristic instead of a search algorithm. This helped us by providing an alternative to the existing A* and turn-walk-turn planners. Since the author is able to choose the planning type for each walk action, alternative planning options increase the chance of finding a workable solution.
On February 28, 2026, we introduced a duplication option for all behavior tree nodes. This allows the operator to right-click any node and select “Duplicate” to create a copy. This is useful to speed up authoring, as the author can mutate an existing node if a similar one is needed. Copying and mutating an existing node is often faster than creating a new one from scratch, especially for nearby nodes in a similar context.
On March 3, 2026, we introduced some new icons, shown in 8. At this point in time, the most common actions had an icon. Icons are useful when working with the behavior tree, because they help you locate things much easier and also give at-a-glance verification of node type. This takes the burden off the behavior author to include the action type in the name, which also helps reduce the amount of text in the view.
In March, we made a modification that allows the “execute after” field of actions and the “node to goto” field of the goto node to point to non-leaf nodes. This helps avoid needing to create checkpoint nodes just for the purpose of creating a concurrent sequence. It also preps the goto node to eventually become a “gosub”, where it can point to a sequence with the guarantee that control will be returned when that sequence is complete.
On March 3, 2026, we added a little theta indicator to arm actions when it is defined as joint angles rather than a taskspace pose. This was important because we would often tune the arm configurations using taskspace and forget to switch it back to jointspace. For arm configurations that frame invariant, such as door frame avoidance or table surface avoidance configurations, the joint angle definition is important. However, after tuning these configurations, if the arm action definition was accidentally left in taskspace in world frame, when executed later, the arm would go crazy trying to reach whatever point in world we were at during authoring. We made the default frame of arm actions chest frame to mitigate this issue, but the visual indicator helps the operator to verify this important action option at a glance. Since jointspace actions are generally safer to execute, it also gives the operator some confidence in executing the action.
On the same day, we cleaned up the default frame names available to behavior actions. They were now more human readable like “Left Hand” instead of “afterGripperZLeft” and “Pelvis” instead of “afterPelvisLink”. This could help new operators climb the learning curve faster and help reduce cognitive burden for expert operators.
On March 9, 2026, we conducted a reactivity test of the pull door behavior with Alex. In addition to the fallback node for retrying the door opening, we added one that waits for the doorway to be clear of obstacles before walking through. The test was successful. The robot was disturbed while opening the door three times and retried each time, succeeding on the fourth try. Then, the robot waited for the human to move out of the doorway before starting the walk-through. Unfortunately, the robot fell during the walk-through due to loss of balance, but the reactivity test was a success.
Earlier in March, we implemented a behavior “Preview Mode”, which allows the operator to execute the full behavior against a kinematics simulation robot instead of the real one.
On March 12, 2026, we developed a new element to the scene action “setup object” type node called the “nominal frame”. This nominal frame field specifies a pre-defined pose of the object for use in preview mode. At this point in time we do not have a photo-realistic or physics-accurate simulation of the robot, but we wanted to preview the robot’s motions when doing a full loco-manipulation behavior. When the behavior is running in preview mode,
On March 20, we introduced a navigational mode for the Quick Footstep Planner which uses RRT-Connect to avoid YOLO detected obstacles. It plans from the start to the goal but avoids any YOLO objects by maintaining a radius of avoidance. We didn’t end up using it in an actual behavior yet. We think it might be better to use a capsule point check with the point cloud to make it more general. However, RRT-Connect may be too slow when the collision checks have to query the point cloud. It may be a good application of an occupancy map.
We had talked over the years about how to do a video-editor-like horizontal bars implementation where the behaviors can be viewed by concurrency, start time, and end time. On March 22, we took a stab at an initial behavior timeline implementation as seen in 10. One immediate issue with rendering such a view is that action timings are not always known ahead of time. They are often dependent on real-time events, such as the scene node obtaining a stable detection. When the behavior is run in preview mode or regular mode, the actual action durations are stored in the bars, to provide a retrospective on the behavior’s actions over time.
In trying to build out our general door traversal behavior, we needed mirrored versions of our right pull and left push door behaviors. This became a priority on March 25, 2026 because we were about to attempt our first real-world door traversal – our break room door. Our break room door was a left pull door. Having added the duplicate node feature back in February, we decided to implement a “Mirror” operation as a similarly general feature. Some of our actions were able to be mirrored between left and right invariantly: jointspace arm actions, neck yaw actions, and spine yaw actions.
Other actions required additional information to mirror such as the door-relative approach footsteps. To address our immediate needs, we simply hard-coded a door specific mirror option for five of our action types with a “Mirror (door)” option in addition to the invariant ones. This door-specific mirroring option applies to condition nodes, scene action nodes (for the nominal object poses), arm action nodes, screw primitive action nodes, and walk action nodes.
We also added the option to mirror subtrees, i.e. (“Mirror Subtree” and “Mirror Subtree (door)”), which would try its best to perform the mirroring on the subtree recursively using the functionality we just described on each node or no mirroring if it was not covered. This actually worked really well to create a mirrored version of the pull door behavior which we then took to the break room door to try it for the first time. We were able to preview the mirrored behavior in simulation to verify the general motions before deploying the real-robot.
On March 26, we attempted to traverse the break room door, which would be the first time an IHMC robot has traversed a real world door as opposed to a lab-constructed one. In less than two hours of tweaking the behavior and attempting the behavior, we got a full door traversal to execute successfully in about 33 seconds. A frame from this run is shown in 10.
In late March 2026, we decided to focus on demonstrating the versatility of our approach by manipulating objects on tables. Since FoundationPose tracking was not working so well at this time, we picked balls as an object to manipulate since the grasp of a ball does not depend on the ball’s orientation. Since we did not have any orientation-invariant graspable objects in and of our in-house trained YOLO models, we had to find a generally available model to use. We picked the default “yolov8n-seg” model that comes out of the box with YOLOv8. It has a “sports ball” object class that just barely worked for our use case, detecting our colored tennis balls and baseballs with widely varying confidence levels. Our balls could be detected with high confidence in the 70-80% range, but we could only rely on the confidence levels to be above 2% or so, as they were very often that low.
As seen in 11, on March 31, 2026 we had our first ball pick and place behavior running. We used our “sphere contains” condition check to only pick up balls that were in a reachable region on the table, to avoid catastrophic unreachable grasp attempts. This version would often miss the grasps and had some unnatural looking arm configurations and slow trajectories. We also had some issues with the reliability of the YOLO persistent detections. Another problem was that we could not detect the storage containers and the balls at the same time because they were supported by two different YOLO models.
By April 3 2026, just three days later, we had worked through a lot of these issues and achieved a much more resilient behavior. To start, we corrected unnatural arm configurations and sped up the motions, so the behavior was faster. Secondly, we added the ability to run more than one YOLO model at a time, round-robin style, so we could perceive the balls and the container at the same time. Thirdly, we implemented YOLO model and persistent detection settings management to the scene action node. This meant that the behavior author could decide which YOLO models were running, the enabled object classes within them, and the confidence thresholds for specific object classes for both the YOLO model and the persistent detections. This meant that the behaviors could configure the sports ball class to allow persistent detections of very low confidence, such as the 2% setting we used, to get much higher reliability.
In order to demonstrate online decision-making and reactivity, we then sought to sort the balls by color into separate containers. We also planned, for our annual robotics lab open house where we share our work with the public, to trigger specific behaviors based on ball color. For example, picking up a green ball would cause the robot to go and deliver that green ball to a box on a table beyond a door. On April 4, 2026, as seen in 12, we achieved a demonstration in which the robot successfully sorts five balls by color into two different containers, while successfully handling a disturbance where a ball was removed from the table just before the grasp. In this demo, the behavior was authored to pick up the balls and inspect them while still in hand for grasp success and color. If there were no points within the hand, we assumed the grasp failed and returned to pre-grasp. If there were points in the hand, we checked if they were yellow, in which case they would be placed in container A, else they would be placed in container B.
Around this time we were also exploring options to increase the reliability of our door traversal walk throughs, as the robot would often lose balance during them. Since we had been working with RL mimic policies for boxing and martial arts at the time, we decided to try to incorporate the door walk-throughs as a robustified RL mimic policy trained on the whole-body motion preview available in the behavior system. We trained push and pull door policies and integrated a behavior node called the “Mimic action” which would transition from our model based controller to the mimic controller, perform the mimicked motion, and transition back to the model based controller. We haven’t yet run the mimic policy for the door traversals at the time of writing, but we did train and demonstrate several dance moves as behaviors triggered by picking up different colors of balls, such as an “I love you!” dance, a “Dab” dance, and stretching dance.
Around this time we ran into a theoretical limitation with approaching tasks from a distance. Given that perceptual capability degrades with distance to a task and we first just need to navigate and approach it, we need a reference frame to define the approach in. Since the pose estimation of the object has the object’s orientation, it does not work as a frame to specify the approach in. Nominally, we would want the robot to approach the task by taking the shortest path from the robot’s current location to the task. Since we don’t want to run into the task, we need to “back off” the approach stance from the object toward the robot. To do this, on April 4, 2026, we decided to add a type of behavior scene object called a composite frame. This composite frame would exist as a privileged object in the behavior scene such that it would be usable in the same way to define actions.
There are currently two types of composite frames: an approach frame and a hybrid frame. They are both generalized to be named frames as a derivative computation of two pre-existing behavior frames. In 13, this feature is shown. The approach frame’s orientation is defined to face from frame A to frame B and its position is defined to be on the line segment from frame B to frame A at some tunable distance from frame B. This makes the approach frame suitable for walking directly towards an object, but stopping before getting too close.
The hybrid frame is similar. It takes the position of frame A and the orientation of frame B. The hybrid frame is useful for approaching ajar door panels, where you want to approach with the orientation of the door frame but the position of the door opening mechanism.
Composite frames also can be layered. For example, the ajar door hybrid frame could be used in a subsequent approach frame to approach an ajar door’s handle from a distance. This composite frame mechanism is designed to be extended and generalized further based on encountered real world applications.
On April 7, 2026, we fixed a pretty major race condition bug between the scene actions and the behavior scene. We had been having to put wait durations after scene actions because the bug would cause the subsequent physical actions to use outdated scene object frames when run in automatic mode. We fixed this through safer and more thorough scene management and synchronization. The behavior tree and scene are managed on the same thread, but it took multiple update ticks for the scene object’s pose to reflect the updated persistent detection’s pose. We mention this bug fix because it marks a leap in trust of the system, prevents user frustration, and avoids robot damage from actions reaching to stale, unreachable object poses.
In the same week we developed a novel algorithm for humanoid robot table approach. Using our point-in-shape counting CUDA kernel, we designed a special heuristic scene object called the “Approach Table” object. This was similar in spirit to the heuristic door panel object discussed previously, but differs in that it does not use any semantic detections. Instead, we sweep two vertical capsules forward from the robot’s hips with the intention of colliding with the table’s edge, as shown in 14. A tunable threshold for the number of points to be considered constituting the table edge is defined in the settings of this type of scene node. Typically a value of 300-400 points seemed to be good. The capsules start around knee height and end just below chest height with the intention of handling tables of various heights. When a capsule collides with the table, it stops the forward sweep. The two capsules sweep independently. The result is that a line segment in the X-Y plane is now identified as the table edge. This line segment and the current stance height are used to form a reference frame on the ground with the orientation of the table edge. This reference frame can then be used by a subsequent walk action to perform a squared-up approach to the table edge.
Anecdotally, we found this technique to be very reliable and able to approach our tables within a few centimeters of accuracy. This was also an important capability milestone for our behavior system. The ability to approach tables with this degree of accuracy is a necessary part of obtaining the reachability of the items on the table and avoiding failures caused by running into the table.
Open House and Multi-Station Sorting
On April 9 and 10, 2026, we designed and rehearsed a demo for our open house where the robot would have a table station where balls of different colors would be fed to the robot through a tube system. The robot would be tasked with picking up the balls, determining their color, and executing a specific behavior for each color, in an infinite loop. Yellow balls would simply be placed in a chute to deliver the ball back to the visitor’s side. Green balls would trigger a delivery behavior where the robot would traverse a door and deliver the green ball to a box on a table beyond that door, as presented in 15. The rest of the colors would trigger different RL mimic dances. For example, red balls would trigger the “I love you!” RL mimic dance.
We did some partially successful rehearsals of this demo, but on April 10, 2026, just before the demo, the robot’s legs fell off its body in a dramatic fall. We were able to repair the robot, but avoided doing any more walking that day. During the demo, we picked and placed 199 balls at the table station.
Our last day of testing before the time of writing this thesis yielded some important results. On April 14, 2026, we extended our reactive and robust ball sorting behavior to a multi-station ball sorting behavior. The intention behind this demonstration was to show a compelling loco-manipulation task beyond door traversals. The robot and behavior author were tasked with sorting colored balls between two containers on two different tables, requiring walking between the tables. A still frame from this behavior is shown in 16. In a 1 hour and 50 minute time period we were able to extend the stationary sorting behavior to multi-station sorting. In the final demonstration run, the robot sorted 9 balls correctly, with three table approaches including two table-to-table transitions. However, the robot did not perceive the containers and there was one pause where the human experimenter had to shift one of the balls to get YOLO to detect it.
On that same night, we conducted two reliability tests for door approach and opening on both push and pull door variants. We were able to achieve 11 push door approach and openings in a row and 12 in a row for the pull door. This final demonstration in our story marked a reliability milestone in loco-manipulation.
Reflections on a Decade of Development
Over the course of nearly a decade, we think we made dramatic progress in robot autonomy and capability using our behavior system. In the DARPA Robotics Challenge Finals in 2015, the most successful teams had expert roboticists crowded around operator interfaces, meticulously managing every detail of an hour-long 8-task run. In 2026, we are able to somewhat casually author and run neglect-tolerant loco-manipulation behaviors with just a few expert roboticists. In addition, any video you will see of the DARPA Robotics Challenge is sped up by 5, 10, or 30 times realtime speed. Our automatic loco-manipulation behaviors are watchable and interesting when played at 1x speed.
In the 2016-2021 era with Atlas, behaviors were hard-coded, developed features in themselves and only used perception for locomotion. In 2021, we introduced Robot Data eXplorer (RDX), a novel development tool suite for data visualization, algorithm development, and robot operation. In 2022, we invented the runtime-editable behavior sequence for humanoid robots, which allowed for an explosion of the size of our behavior library, given they could be authored in an operator interface instead of being hand-engineered with code. In 2024, with Nadia, we made the behaviors fast and started using perception for manipulation, executing some of the fastest door traversals ever done by humanoid robots and weaning our dependence on ArUco markers to direct environmental perception. In 2025, with H1-2 and the Psyonic Ability Hand, we upped our game with 5-finger anthropomorphic manipulation. Finally, in late 2025 and early 2026, with Alex, we made perceptual reactivity runtime-editable and demonstrated the ability to author novel loco-manipulation behaviors in hours.
A few items of significance in the timeline are of note. Firstly, in 2025, we had decided to explore vision-language-action models starting in March. By October, we hadn’t gotten it working, so we decided to abandon the effort. What’s interesting is the impressive pace of development that followed. The October 2025 - April 2026 period yielded some of the most significant advancements in the entire system, such as the points-in-shape counting capability. Also in that time period, the advancement of manipulation sky-rocketed, from the mustard picking with H1-2, to the 199 ball sorting open house demo, to the 9/9 multi-station colored ball sorting demo.
The pace of development is currently very strong. We are extremely proud of this work and the very large number of team members that contributed to this system in the last decade did an incredible job. At the same time, we are excited for the future. We want to see this system extended further and deployed on our new tetherless version of Alex for real-world use cases, not just tech demos.
In the next chapter, we’ll provide a guide on the behavior authoring process in 2026.
Usage Guide
Overview
The behavior system presented in this thesis can be used to author fast, resilient, and adaptable loco-manipulation behaviors for humanoid robots. Examples include door traversals and sorting objects on tables. This is in the pursuit of automating useful work and providing robotic drop-in replacements for humans for dull, dirty, or dangerous jobs. In this chapter, we’ll cover the system’s limitations, prerequisites, and how to use it.
Limitations
Perception
We’ve made a lot of progress and performed some compelling demonstrations, but there is still a way to go. In 2026, we’ve had an uptick in the expressiveness of our behaviors demonstrated on Alex through authorable perceptive reactivity and an expanded set of YOLO models. However, we still do not have reliable object pose perception, which is why we have done so many demos with colored balls. We have integrated FoundationPose for this, but it still needs a bit of work. It is still possible to manipulate asymmetric objects if assumptions can be made about their initial state.
Platform Bringup
Another main limitation is that, to get this working on a new robot platform, there is significant bringup work. Computer software and hardware engineering is required to bring up support for different sensors, hands, or controllers that your robot may have.
Another major limitation of the current implementation is the lack of navigational components. It is possible to create room searching behaviors by turning in place and walking through doors or walking to objects if you see them, but there is no functionality in the scene for storing semantic topological maps or high-fidelity detailed map models.
Behavior Tree Limits
There are some smaller limitations and user frustrations that are present as well. One of the bigger ones is the lack of the “GOSUB” functionality, where a sequence can be put on the stack and execution returned when it completes. All we have at the moment is GOTOs. Another is that when nested JSON-backed behaviors are included in multiple places in the tree, modifying one does not modify the other, and they will overwrite each other when saved. We would like to fix these soon.
Prerequisites
Robot Platform and Controller
We have run our system with several humanoid robots over the years, including the Boston Dynamics DARPA Robotics Challenge Atlas, NASA’s Valkyrie, IHMC and Boardwalk Robotics’ Nadia, Unitree’s H1-2, and IHMC’s Alex. Our system is centered on the humanoid form factor with footstep, walk, arm, leg, spine, and neck actions. It is also centered on the presence of a whole body controller that can achieve those actions. Ideally, the whole body controller can triage and queue asynchronous commands of those actions, to support concurrently tracked motions of different parts of the body that may start and end at different times. For communication, we have been using a ROS 2 compatible DDS implementation, but other communication protocols could be subbed in with engineering work. The controller could also be operated through a programmatic API and some of the interprocess comms avoided.
To get the best performance, the robot’s controller should have push-tolerant standing and walking. Even better if the robot can walk with the arms held up in front of the body. If it can do this, it would be sufficient for basic multi-station loco-manipulation tasks and door traversals.
Motion and Reachability
We think longer arms generally work better than shorter ones. A human-proportioned robot arm yields significantly less reachability than a human arm since humanoid robots typically don’t have clavicle joints which extend a human’s arm reach. We tend to think that robot arms that are long enough such that the hands touch the knees when they are hanging are a good balance. DRC Atlas and Nadia had these longer-arm proportions, but Alex didn’t and we felt that it limited reachability and increased task complexity for the pull door behaviors and when reaching across tables.
Similarly, a high degree of freedom spine can help simplify behaviors. For example, being able to yaw 90 degrees to the left and right can allow for faster scanning behaviors, increase arm reachability, and avoid unnecessary footsteps in tight spaces.
Perception and Teleoperation
On the perception side, the most significant requirements are a color image, a high-accuracy depth image, and the availability of high quality YOLO models for any objects you want to interact with. There are many sensors available on the market that satisfy this requirement. We have tried the ZED X Mini and the RealSense D457. We think both of those would work.
Our preference and the one we currently use is the ZED X Mini because it has a stereo RGB camera pair which is useful for direct VR teleoperation. To author complicated behaviors, an idea ahead of time of how the robot will accomplish the task is desirable. With humanoid robots, this is helped by our natural intuition since the robot shares our form factor, but how the robot’s capabilities differ from our own can be unintuitive. This is why teleoperating a task can be a good way to quickly experiment with task strategies that work within the bounds of the robot’s capabilities, and why we chose the ZED X Mini. The ZED X Mini also has a human-like inter-pupillary distance, which makes for a natural robot embodiment by a VR teleoperator.
The other major thing you will need is a library of YOLO models for everything you need to interact with. We have been using YOLOv8 because it does detections and segmentation for objects of interest. Argyha Chatterjee crafted the models we used, but it took him weeks and weeks of hand labeling and a year or so of experimentation to get them to be high confidence. It might be good to find a vendor that provides professionally trained models. Typically the data needed is around five videos of the task, each on a different day and time, light and dark, where the videos orbit around the object and get close and far from it.
Hands
For robot hands, we recommend a 5-finger anthropomorphic hand. We had a great experience with the Psyonic Ability Hand. One major advantage of a 5-finger anthropomorphic hand is that, similar to the rest of the robot’s body, it mirrors the human form, which allows for an increased level of intuition in task planning. For example, a good start for planning how to get the robot to do a task is to just do it yourself and look at how you naturally do it.
Furthermore, this extends to teleoperation, where you can constrain the set of possible solutions to what is achievable by the robot while still leaning on natural intuition. We use the Valve Index Controller’s API to teleoperate the Ability Hands and it works great. In fact, the OpenVR software maps the Valve Index Controller’s finger estimation to six degrees of freedom which happens to match perfectly to the six degrees of freedom on the Ability Hand.
Compute
Another thing you’ll need are NVIDIA GPUs if you want to use the ZED camera or if you want to use CUDA kernels. The ZED SDK requires NVIDIA GPUs to use its most essential features such as its neural-assisted stereo depth estimation. Additionally, we use CUDA heavily throughout the system, including for counting points and averaging color inside virtual shapes in the 3D point cloud, faster OpenCV functions, processing the output tensor of YOLO, non-maximum suppression of YOLO bounding boxes, and extraction of 3D points from the YOLO segmentation masks.
We use a Jetson AGX Orin on-board the robot and a desktop computer with an NVIDIA Turing GPU for the operator computer. We run Linux on both except when doing VR teleoperation, when we use Windows. We have had success with VR on Linux in simulation tests, but not on the real robot. For VR headsets we’ve been using the Valve Index and the HTC Vive Focus 3.
In the next several sections we will walk through some token behavior authoring examples while explaining how system components work as we encounter them. This is a guide to how the system works, but it’s meant for reading and understanding more than it’s meant to be literally followed as a tutorial. We’ll first use simulation examples to illustrate and explain some basic mechanisms without noise. Then, we’ll go through a real-robot authoring session: our 32 minute authoring session to get Unitree H1-2 opening a door repeatedly.
Basic Example: Move the Arms and Walk
Simulation Setup
To start, we’ll be working in a simulation environment that we call the “behavior test facilitator”. It allows an operator to exercise system functionality without a vision and physics accurate simulator and without addressing any sim-to-real gaps. We use a kinematics-only simulation of the robot that plays back the nominal motions without feedback from a physical environment. In effect, this means that the desireds ultimately get set to the actuals for free motions and a special heuristic is used to hold the feet in place when ground contact is expected.
To open the program, we will run the Java program AlexRDXBehaviorTestFacilitator.
The initial view is shown in 1.
The behavior tree panel is on the left and the 3D view is on the right.
The behavior tree panel is where an interactive model of the currently loaded behavior tree is rendered.
The 3D view renders the current, live robot state as an opaque, realistically colored robot model alongside virtual interactable elements that model planned actions and virtual scene elements.
3D Viewport Camera
The 3D view camera is controlled using the scheme developed and used in the DARPA Robotics Challenge era user interface. We use this “focus based” camera view control algorithm because it simplifies getting the camera where you need it so it can be done quickly. Since robots typically exist in 2D spaces under gravity, the most common translation camera movements are in the X-Y plane. Additionally, our tasks most often focus on controlling a robot or inspecting specific things positioned in the environment. For this reason, our “focus based” camera is based on a movable focus point in 3D space, meant to be located at the thing you are inspecting or monitoring. The camera always faces this focal point and in the user interface it is represented as a small red sphere which is resized dynamically in 3D to be a constant small size in screen space, so it doesn’t get huge or too small to see. The keyboard controls, W, A, S, and D, translate the camera on the current X-Y plane that the focus point resides on. The Q and Z keys move the focal point up and down along the world frame Z axis. The camera orbits the focal point using a longitude and latitude. The longitude freely loops 360 degrees around the focal point about its world Z axis. The latitude, however, is bounded by +/- 90 degrees above and below the focal sphere’s “equator”. For example, the limits are top down and bottom up, preventing the view from going upside down on the other side. The full set of 3D view camera controls are presented in 1.
| Focus based camera | Key |
|---|---|
| Drag to orient camera | Left mouse |
| Drag to pan camera | Middle mouse |
| Fine adjustment | Shift |
| Move camera back | S |
| Move camera down | Z |
| Move camera forward | W |
| Move camera left | A |
| Move camera right | D |
| Move camera up | Q |
| Zoom camera in | C |
| Zoom camera in / out | Mouse scroll |
| Zoom camera out | E |
Focus based camera keyboard shortcuts
Loading the Tree
On startup, no behavior tree is actively loaded, so the “Load existing tree from file menu” is shown. There are two options from here: starting from scratch with a new root node or loading an existing behavior from file. In this example, we’ll start from scratch and click “Root Node”.
Behavior Operation
At this point, the view has transitioned into the primary working area for a loaded tree, as shown in 2. At the top of the behavior tree panel, Conflict-Free Replicated Data Type (CRDT) statistics are shown. This helps keep the operator informed about whether data is synchronizing correctly between the user interface and the robot-side autonomy process. Next to the file and view menus, the node count for the operator-side and robot-side are shown. These two numbers should be the same. The frequency rendered on that line should be approximately 30 Hz, which is the desired rate of synchronization. On the next line, a count of updates for the UI side (local) and robot side are shown. These should be monotonically increasing but don’t need to be the same. There is an out of order message counter which should be 0 and incrementing a little is likely okay, but going up quickly is indicative of a problem.
The next lines regard execution operation. The left and right pointing arrows decrement and increment the next node index to execute, which is printed here as “Index: 000” for our new tree. On that line is a checkbox that toggles automatic execution and a button that manually executes the next concurrent action set. This checkbox and button will cause the robot to immediately begin executing the behavior and possibly move! In other words, these are the “go” buttons. Automatic execution is started by checking the box and stopped by unchecking the box, so keep your mouse near it. Automatic execution will also stop if any action fails that is not handled by a fallback catch or the end of the tree is reached.
On the next line, leaf node failures can be reset with the “Reset failures” button. It is not normally necessary to do this, but can be nice if you want the blinking red to stop. Also on this line, concurrency can be enabled and disabled with a checkbox. This disables the concurrent functionality and every action will be run sequentially. This is sometimes useful during authoring if it is desirable to run only one action but it is a part of a concurrent sequence. The third widget on this line toggles on and off the preview mode, which retargets the executed behavior to a kinematics-only simulation robot. When preview mode is enabled, a transparent preview robot will appear in the scene and execute the behavior in place of the real robot. The preview mode is useful to verify compound motions in themselves and with respect to the scene.
The area that currently displays “Nothing executing” will display live information about the currently executing action or actions. The final element in the view is the root node, with a representative icon with three circles connected by two lines.
Building the First Behavior
Right clicking the root node prompts a context menu which offers the option to add our first child node, as shown in 3. In this first example, we’ll build a small, simple behavior that moves the arms and walks.
4 shows the node creation menu, which allows the operator to create a new node of any available type or loading an existing tree from file. The available node types are sorted into control nodes and action nodes. For this behavior, we will select the action sequence type, which serves as an organization element and a mechanism by which we save the behavior to file.
When the action sequence is created we double click the default name “Action sequence”, type in “Demo Behavior.json”, and hit the Enter key. This is shown in 5. Adding “.json” to the end of a node name is what makes it saveable to file. Saving is done by right clicking the node and selecting “Save to File”, pressing “Ctrl + S” while the mouse is hovering in the behavior tree panel, or using the file menu at the top of the panel. An asterisk symbol (*) is shown next to the node name when changes are present that have not been saved. When the tree is saved, the asterisk symbol should disappear. The behavior can safely be saved at any time and we do it often to avoid losing work.
Arm Actions
We then add an arm action by right clicking the sequence node and clicking “Add Child Node…” as before, then click “Right” on the Arm row which instantiates a new arm action node with side set to right. For sided actions, we currently don’t allow changing the side after creation, however, that isn’t an entirely purposeful design choice. Now that the arm action has been created, the node can be seen in the tree, beneath our sequence node and indented to the right, signifying that it is a child of the sequence node.
When we single-click select the arm action node (anywhere except on the sideways arrow icon), the node’s line is highlighted, as shown in 6. In the lower portion of the behavior tree panel, there is a node settings area. This area can be closed using the “X” in the top right of its area, but will open whenever a node is selected, as we did above. This setting area renders the settings of whichever node is selected, but only for one node at a time.
To make things go faster, we will reduce the arm action’s trajectory duration to 1 second by double clicking the current value, typing 1, and hitting the Enter key. There are two main ways to define this arm action: by adjusting the hand goal pose with a 3D pose gizmo and inverse kinematics solver, or by using sliders to define the arm’s joint angles directly. This is decided using the “Use Predefined Joint Angles” checkbox.
3D Pose Gizmo
As shown in 7, to adjust the arm action via hand pose, check the “Adjust Goal Pose” box. A 3D pose gizmo appears in the 3D view. The gizmo’s axes are colored as Red, Green, and Blue, to match X, Y, and Z, and Roll, Pitch, and Yaw. A way to remember it is “RGB -> XYZ”. The tori can be dragged with the mouse to adjust the orientation. The arrow heads and tails can be dragged with the mouse to adjust the translation. The gizmo can also be adjusted via the keyboard. A key for gizmo keyboard controls is presented in 2. Right clicking the gizmo will display a context menu that allows for numerical pose adjustment, fine and coarse increments, resetting to zero, and changing the modification frame. The gizmo is, by default, modified in camera Z up frame so it translates laterally on the world X-Y plane and vertically on the world Z axis, similar to the focus based camera.
| Pose 3D gizmo | Key |
|---|---|
| Fine adjustment modifier | Shift |
| Manipulate axes | Left mouse drag |
| Open context menu | Right mouse click |
| Pitch adjustment + | Alt + Up arrow |
| Pitch adjustment - | Alt + Down arrow |
| Roll adjustment + | Alt + Right arrow |
| Roll adjustment - | Alt + Left arrow |
| Translation adjustment X+ | Up arrow |
| Translation adjustment X- | Down arrow |
| Translation adjustment Y+ | Left arrow |
| Translation adjustment Y- | Right arrow |
| Translation adjustment Z+ | Ctrl + Up arrow |
| Translation adjustment Z- | Ctrl + Down arrow |
| Yaw adjustment + | Ctrl + Left arrow |
| Yaw adjustment + | Ctrl + Mouse scroll down |
| Yaw adjustment - | Ctrl + Mouse scroll up |
| Yaw adjustment - | Ctrl + Right arrow |
Pose 3D gizmo keyboard shortcuts
Frame-Relative Action
When an arm action is defined by a pose, it is specified in a selectable parent frame. By default, this frame is chest frame. Changing this frame to an object class such as “Door Lever” allows for a pose that will be relative to that object, wherever that object is in the scene. The “Hybrid” and “Jointspace Only” options and the weights can be ignored. They are specific to how our model based whole body controller tracks the arm command. What is important is to note that any whole body controller specific options you may have can be included in this arm action’s definition. It is meant to be extendable and flexible rather than a rigid specification. The inverse kinematics solution quality is displayed where values from 0 to 1 are good, and >1 is bad. Bad solutions mean the pose is unreachable and will not be achieved consistently with our implementation. The transparent arm graphic will turn red to signify this. A “Set Pose to Synced Hand” option is available to reset the pose to where the hand currently is on the real robot.
As seen in the bottom left of 7, position and orientation error tolerance settings are available. If the hand’s pose is not within the error tolerance of the goal pose by the end of the trajectory duration, the action will fail. This setting can be adjusted based on the required precision of the motion. This functionality is very context and controller dependent. For example, a controller may continuously try to achieve the goal pose or it may stop trying once the trajectory duration is over. As another example, when applying a force on something is desired, a position setpoint may be placed beyond an immovable object, resulting in a desired pose error. In these cases the error tolerances may not make sense or may need to be large. It may also be that we should add a timeout for achieving the motion in addition to the nominal trajectory duration.
Jointspace Mode
The other way to define arm actions is by specifying the joint angles directly. To do this, check the “Use Predefined Joint Angles” box. This will dynamically change the available settings in the panel, now showing sliders for all the joints. As shown in 8, these sliders can be used to set the joint angles. As the sliders are dragged, the full arm 3D preview is updated to provide an interactive experience. The sliders are bounded by the joint limits from the robot model. A joint angle can also be input as a number manually in the input box next to the slider. This mode has a “Set Configuration to Synced Arm” button to reset the values to match the real robot’s current configuration. It also has a position error tolerance setting which is the maximum allowable sum of joint angle errors throughout the arm. In our example behavior, we roll the arm out from the body and pull the forearm up.
Arm Action Execution
To execute this arm action, we ensure the hollow arrow icon next to the action is green, meaning it is selected as the next action to execute. To execute the action, we click the (Execute) “Manually” button at the top of the panel. The simulated robot performs a 1 second trajectory to the goal configuration.
We would like to extend our arm action to support N-length trajectories, as we do with our screw primitive covered later. One use case would be to record teleoperated motions, store them to a CSV file or something, and execute them with the arm action. Another option would be to allow the behavior author to add and remove waypoints tuned using gizmos. Using multi-waypoint trajectories would allow the arm to keep moving through poses instead of stopping at each one, as is currently enforced.
Mirroring an Arm Action
Next, we will mirror this action for the left arm using the “Mirror Node” context menu entry on the arm node, as shown in 9. A second arm action appears, already in joint angle mode and mirroring the other arm action. We manually execute this action.
Walk Action
As the final action in this first example behavior, we will walk forward a little bit. We right-click the second arm action, click “Insert Node After…”, and select the walk action. The walk action now appears in the tree. We click it to access its settings, which are shown in 10. For this example, we’ll keep the frame set to “Walking” frame, which is a frame on the ground underneath and facing the direction of the pelvis.
The walk action supports setting controller specific settings such as the walking speed via the foot swing and double support transition durations. An “execution mode” setting specifies whether the robot should finish any steps it may have queued versus overriding those and taking the first step of this walk action after the current step is completed. This setting is also controller specific.
The walk action is currently implemented to dispatch a list of footsteps as 3D sole poses to a controller for execution. The steps can be specified manually by adding and tuning them with gizmos or planned. Footstep plans can be converted to a manually defined plan for action definition, but otherwise, planning will happen on action execution.
Walk Goal Specification
To define the planning goal, we use a mid-stance point and a focal point. The robot walks to the stance location and ends facing the point. We project the goal Z to the robot’s current walking frame Z, which lies on the ground between the feet. This format of goal specification works well to keep the robot from stepping in the air when on flat ground. It also makes the robot’s goal facing orientation easier to tune by placing the focal point farther from the stance point. This separation acts as a “lever arm” for goal orientation precision.
The goal footsteps are also each tunable in this mid-stance goal frame. By default they are even at the controller’s default stance width for a squared up goal stance. However, for many task approaches, such as pull doors, we require a staggered stance. The “Left Foot to Goal” and “Right Foot to Goal” checkboxes toggle gizmos to tune the goal footsteps. In 10, the stance point, focus point, and the right footstep are all being tuned.
Footstep Planning
We currently have three footstep planners available to the walk action: the quick footstep planner, the turn-walk-turn planner, and the A* planner. The quick footstep planner is the newest option. It is a procedural geometry-heuristic based planner that plans quickly and reliably. It is designed to have as few failure modes as possible and reduce unnecessary steps. Though the heuristics are general, we focused on ensuring the plan for approaching pull door and getting into the staggered stance was high quality. It supports walking to waypoints without specifying goal footsteps. The option to not specify goal footsteps can speed up behaviors by removing an unnecessary square-up step. It currently only supports flat ground, but we think it could be extended to plan over terrain maps.
When the quick footstep planner is selected, an option for RRT-Connect path planning is available. The current implementation will simply maintain a tunable distance from objects in the behavior scene while taking the shortest path. This is an experimental mode that we would like to make more general using an occupancy map or something.
The turn-walk-turn is another procedural heuristic planner for flat ground, but often has many unnecessary steps in the plan, reducing overall behavior speed. We don’t use this one much. It is mainly intended to be a backup option in case the other two don’t work for some reason in certain situations.
The A* planner, as presented in , is a search based planner for flat ground and rough terrain. It uses a large set of parameters that define the ideal and boundary step criteria and then searches over an SE2 (X, Y, and Yaw) lattice to find the most optimal set of footsteps to the goal feet. The planner can snap footsteps to planar regions and wiggle them as part of the search and in the pursuit of achieving a stable foothold. It can also plan over a height map. In this way, the A* planner gives the behavior system a rough terrain capability. For flat ground, we don’t use it as much to avoid the extra planning time and increased number of failure modes.
Manual Footsteps
The walk action also allows the operator to define a manual footstep plan by checking the “Manually place steps” box. Then, footsteps can be added, removed, and tuned with gizmos. A “Select All Footsteps” button is available to select all footstep gizmos in order to move the whole plan using the keyboard arrow keys. A “Reset footstep height” option is available to reset all the footstep Z heights to the current robot height, which is useful for flat ground plans.
The manual footstep plan option is especially useful when the robot walks through the door frame, mainly because a planner has not been written for that case. Door traversal footstep plans have to straddle the door frame in order to not hit the shoulder on the door frame. It also helps to make the footstep plan narrower, to reduce side-to-side sway, which can reduce the severity of collisions with the door panel.
Executing a Walk Action
In 11, we execute the walk action to complete our first example. A progress bar can be seen that is tracking the action. The 7.58 second total is calculated by adding up all the transfer and swing times of the planned footsteps. Virtual, numbered footsteps are displayed in the 3D view to show the robot controller’s current queue.
Simple Behavior JSON File
A simplified version of the saved JSON for this simple example is presented in 12 and 13.
[
fontsize=\scriptsize,
breaklines=true,
breakanywhere=true,
frame=single,
rulecolor=\color{black!30},
framesep=2mm
]
{
"type" : "ActionSequenceDefinition",
"name" : "Demo Behavior.json",
"notes" : "",
"children" : [ {
"type" : "ArmActionDefinition",
"name" : "Move Right Arm",
"notes" : "",
"children" : [ ],
"executeAfterAction" : "Previous",
"side" : "right",
"trajectoryDuration" : 1.0,
"usePredefinedJointAngles" : true,
"preset" : "CUSTOM_ANGLES",
"j0Degrees" : 40.1,
"j1Degrees" : -22.92,
"j2Degrees" : 22.92,
"j3Degrees" : -108.86,
"j4Degrees" : 0.0,
"j5Degrees" : 0.0,
"j6Degrees" : 0.0,
"positionErrorTolerance" : 0.3,
"jointspaceWeight" : -1.0
}, {
"type" : "ArmActionDefinition",
"name" : "Move Left Arm",
"notes" : "",
"children" : [ ],
"executeAfterAction" : "Previous",
"side" : "left",
"trajectoryDuration" : 1.0,
"usePredefinedJointAngles" : true,
"preset" : "CUSTOM_ANGLES",
"j0Degrees" : 40.1,
"j1Degrees" : 22.92,
"j2Degrees" : -22.92,
"j3Degrees" : -108.86,
"j4Degrees" : 0.0,
"j5Degrees" : 0.0,
"j6Degrees" : 0.0,
"positionErrorTolerance" : 0.3,
"jointspaceWeight" : -1.0
},
[
fontsize=\scriptsize,
breaklines=true,
breakanywhere=true,
frame=single,
rulecolor=\color{black!30},
framesep=2mm
]
{
"type" : "WalkActionDefinition",
"name" : "Walk action",
"notes" : "",
"children" : [ ],
"executeAfterAction" : "Previous",
"swingDuration" : 0.8,
"transferDuration" : 0.5,
"executionMode" : "OVERRIDE",
"parentFrame" : "Walking",
"goalStancePoint" : {
"x" : 1.292,
"y" : -0.013,
"z" : 0.0
},
"goalFocalPoint" : {
"x" : 2.2745,
"y" : -0.002,
"z" : 0.0
},
"leftGoalFootToGoal" : {
"x" : 0.0,
"y" : 0.11,
"yawInDegrees" : 0.0
},
"rightGoalFootToGoal" : {
"x" : 0.0,
"y" : -0.11,
"yawInDegrees" : 0.0
},
"planner" : "QUICK",
"plannerParameters" : { }
} ]
}
Concurrent Actions Example
Behavior Timeline
In this example, we’ll show how we can schedule arm motions while walking using our concurrent action layering system. 1 shows a behavior we have already built and run as the non-concurrent starting place. In this tree, we have added the pre-existing “Go Home” skill from a file, which allows us to reset the robot to the home configuration. This home configuration has the arms down by the sides in a natural position. Then there is a checkpoint node and a savable concurrency demo behavior, with a walk action followed by two arm actions and an always failing “STOP” condition.
Our behavior timeline feature is presented in the bottom right of 1. It is a rendering of an action sequence over time meant to help with understanding the timing and concurrency of behavior actions. Here, it shows the tree as executed with each node running sequentially.
“Execute After” Layering Mechanism
Concurrent action layering is implemented simply through an “Execute after” field in each action. It is selectable via a drop down menu in the node settings, which offers every prior node as an option. By default this field is set to “Previous”, which is a dynamic reference to whatever node is immediately before. A dynamic “Beginning” reference is also available, which points to before the root node, though it may be removed because it is functionally the same as pointing to the ever-present root node.
When the behavior is running in autonomous mode, it triggers action execution in order. When deciding whether to trigger execution of the next one, it first checks the referenced node to execute after. If the node to execute after is currently executing, the next node is not triggered. Importantly, action execution does not wait for any nodes that may be in between the “execute after” node and itself. This is important for scheduling concurrent actions freely.
The full algorithm is more complicated because the behavior can also be run step-by-step by clicking the “Manually” button. There is also a checkbox for disabling concurrency in the user interface, which effectively treats all “Execute after” fields as “Previous”. Further complexity is added to handle the fallback node’s try and catch mechanism. We present pseudocode in [alg:simplified_execution] without the fallback part.
Algorithm: Simplified Action Execution Trigger Logic
Function TickExecution(orderedLeaves, state):
# Halt if not autonomous and no manual step was requested
if not state.isAutonomous() and not state.isManualStepRequested():
return
nextIndex = state.getExecutionNextIndex()
for i in range(nextIndex, orderedLeaves.length):
node = orderedLeaves[i]
if node.isExecuting():
continue
# Determine the required dependency
afterIndex = node.getExecuteAfterIndex()
if not state.isConcurrencyEnabled():
afterIndex = i - 1 # Force sequential execution
# Break if the action to execute after is still executing
if afterIndex >= 0 and orderedLeaves[afterIndex].isExecuting():
break
# Trigger execution and advance the index tracker
node.triggerExecution()
state.setExecutionNextIndex(i + 1)
# If running manually step-by-step, consume the step and halt
if not state.isAutonomous():
state.consumeManualStep()
break
Wait Nodes and Dependencies
Our example behavior in 1 starts out as a non-concurrent sequence of a short walk and two arm motions. Now we’ll change it so the arm motions are scheduled during the walk. In 2, we add two wait actions, one for 1 second and one for 2.5 seconds. These waits are used to schedule the arm motions. We’ll move the right arm 1 second into the walk and the left arm 2.5 seconds into the walk.
The wait actions need to execute with the walk action, not after it. To accomplish this, we set the execute after fields of the wait actions to the “ConcurrencyDemo.json” sequence node. The sequence node is never executing, but pointing to it achieves the effect that the wait actions always start with the walk action. This method also keeps the sequence logic self contained and reusable, as opposed to referencing a specific node earlier than the sequence node. We then set the execute after field of the right arm action to the first wait action and the execute after field of the left arm action to the second wait action. In 3, we show setting the execute after field of the second arm action to the second wait action.
The figure also shows the execute after pointers as arrows on the right side of the tree view that point up from the defining action to the dependency action. You can see four such arrows that overlap. When the mouse hovers over a node, the corresponding arrow boldens to make it easy to verify dependencies when they are partially overlapping. At this point in time, the timeline view is updated to show the general structure of the new behavior configuration. In this prototype implementation of the timeline view, all wait actions are rendered as 1 second long. When we run the behavior, the timeline is redrawn with the actual timings. We will do that now.
Concurrent Result
In 4, we show the result of executing our now concurrent behavior. The timeline has been updated to reflect the actual action start and stop times of the simulated behavior execution. As seen by comparing this timeline to the original, the behavior duration went from 11.7 seconds to 7.6 seconds, illustrating how this method can be used to speed up behaviors.
Screw Primitives
Motion Model
1 shows a screw primitive action being authored. Our screw primitive action is inspired by Pettinger’s 2022 paper but differs in implementation.
Our screw primitive action is defined relative to a user-defined screw reference frame, where the axis of the screw is strictly aligned with the local x-axis (\(\hat{\mathbf{x}}\)). The motion is parameterized by a total axial translation \(\Delta x\) and a total rotation \(\Delta \theta\). 1 illustrates one parameterization for reference and the reference video shows varied parameterizations.
Let the initial position of the end-effector in the screw frame be \(\mathbf{p}_0 = [x_0, y_0, z_0]^T\). The radial distance \(r\) from the screw axis to the end-effector is calculated by projecting the position onto the y-z plane: $$ r = \sqrt{y_0^2 + z_0^2} $$
To compute motion bounds, the total Euclidean distance traversed by the end-effector, \(D\), combines both the axial displacement and the tangential arc length: $$ D = \sqrt{(r \Delta \theta)^2 + (\Delta x)^2} $$
The trajectory duration is determined by enforcing maximum velocity limits. Given a maximum linear velocity \(v_{max}\) and maximum angular velocity \(\omega_{max}\), the base duration \(T_{base}\) required to complete the motion is dominated by the most restrictive bound: $$ T_{base} = \max\left( \frac{|\Delta \theta|}{\omega_{max}}, \frac{D}{v_{max}} \right) $$
The trajectory is discretized into \(N\) waypoints (where \(N\) is limited by the system’s trajectory buffer size), yielding \(N-1\) segments. The nominal duration of a single segment is: $$ \delta t = \frac{T_{base}}{N - 1} $$
To account for smooth acceleration and deceleration without requiring complex spline generation, the implementation applies a temporal boundary condition: the duration of the first and last trajectory segments is doubled (i.e., \(2\delta t\)), while all intermediate segments retain duration \(\delta t\). Consequently, the total movement duration expands to \(T = T_{base} + 2\delta t\).
Trajectory Construction
The trajectory maintains constant base velocities throughout the central portion of the motion. The scalar velocities are derived from the base duration:
-
Axial velocity: \(v_a = \frac{\Delta x}{T_{base}}\)
-
Tangential velocity: \(v_t = \frac{r \Delta \theta}{T_{base}}\)
-
Rotational velocity: \(\omega_x = \frac{\Delta \theta}{T_{base}}\)
At any intermediate waypoint \(i\), let the current position of the end-effector in the screw frame be \(\mathbf{p}_i = [x_i, y_i, z_i]^T\). The purely radial vector is \(\mathbf{r}_{\perp, i} = [0, y_i, z_i]^T\).
The unit tangent vector \(\mathbf{u}_{t,i}\) describing the instantaneous direction of rotation is found via the cross product of the screw axis and the radial vector: $$ \mathbf{u}_{t,i} = \frac{\hat{\mathbf{x}} \times \mathbf{r}_{\perp, i}}{|\hat{\mathbf{x}} \times \mathbf{r}_{\perp, i}|} $$
The spatial velocity commands at waypoint \(i\), expressed in the local screw frame, are formulated as: $$ \mathbf{v}_i = v_a \hat{\mathbf{x}} + v_t \mathbf{u}_{t,i} $$ $$ \boldsymbol{\omega}_i = \omega_x \hat{\mathbf{x}} $$ Note that for the initial (\(t=0\)) and final (\(t=T\)) waypoints, the spatial velocities are strictly clamped to zero.
The rigid-body poses are iteratively generated by applying incremental transformations in the screw frame. For \(K\) total segments (determined by spatial resolution limits), the incremental roll rotation \(\delta \theta = \Delta \theta / K\) and translation \(\delta x = \Delta x / K\) are calculated.
The pose of the end-effector at step \(k\), denoted as \(\mathbf{T}_k \in SE(3)\), is generated by prepending the incremental transformation to the previous pose: $$ \mathbf{T}_k = \mathbf{T}_{k-1} \begin{bmatrix} \mathbf{R}_x(\delta \theta) & \begin{bmatrix} \delta x \ 0 \ 0 \end{bmatrix} \ \mathbf{0}^T & 1 \end{bmatrix} $$ This sequence of generated poses and spatial velocities is finally fed into an inverse kinematics solver to compute the joint-space trajectories.
Tuning and Preview
As seen in 1, the total axial translation, total rotation, max linear velocity, and max angular velocity can be specified using sliders. The 3D trajectory visualization is updated in real time as the user drags these sliders to help dial in the motion. As with the single pose trajectory arm action, the screw primitive has settings for the position and orientation tolerance. If the final goal pose is not reached within these tolerances, the action fails.
In order to provide a rich intuition about the motion, there is also a motion preview slider for the screw primitive. The slider may be dragged manually or played back at real time speed in a loop. The arm is blue when the IK solution is good and red when it is bad. The behavior author should strive to keep it blue throughout the motion.
Behavior Scene Management Example
Scene Setup
In the next example, we’ll cover basic behavior scene management by locking on to a tennis ball and authoring a grasp approach arm pose. The point of our behavior scene functionality is to provide authorable perception which can increase the capability and adaptability of the robot-human team. Authorable perception allows the human operator to encode context-specific and expert knowledge into the behavior. The flexibility provided here can be used to work around several categories of issues both theoretical and practical and both environmental and digital. Some examples of issues are: the hand occluding an object, a YOLO model having low confidence, and a latency or timing issue in the system.
1 shows an example setting being played back in simulation but from a real data recording. The real environment contains a table with some transparent storage containers and tennis balls on it. An extension cord is on the table to prevent the balls from rolling off when the robot messes with them.
On the left side of the figure, we see the behavior tree view with a look down action to get the table centered in the camera field of view and two scene actions, denoted by the clapperboard icon inspired by the film industry. In the bottom left, some scene action settings for the “configure YOLO” action type are shown. In the top center is the 3D scene with a live colored point cloud from the ZED and 3D coordinate frames for the current stable detections. The bottom center shows the first person ZED camera view with the YOLO detections and segmentations drawn over it. The detections are annotated with the class name and the confidence of the detection. In the top right, the behavior scene is shown with no privileged objects yet but with 3 stable detections: storage container, bottle, and robot hand. Finally, in the bottom right, our YOLO module is able to be turned on and off and each available mode is shown, which can each be toggled on and off and expanded to tune class specific options.
YOLO Configuration
One of the first things we notice here is that the tennis ball is being detected as a bottle.
This is because the initial YOLO model that is enabled, “best_multi_02_17_2026”, is one that we trained only for a specific set of classes which do not include the tennis ball.
We aren’t sure why it’s detecting the ball as a bottle at 87% confidence.
In any case, to fix this, as shown in 1, we author a scene action with type CONFIGURE_YOLO to switch the YOLO model to the “yolov8n-seg” model.
Additionally, we disable all classes except for the “sports ball” class by unchecking the “Enabled” checkbox next to where it says “UNIVERSAL ADJUSTER”.
Then, we scroll to find the sports ball row and check that box.
Unfortunately, the yolov8n-seg model isn’t very confident in detecting our tennis balls. To work around this, on the sports ball row, we lower the minimum “Confidence” value from 0.7 to 0.1. The minimum confidence value acts as a gating filter for detections to be allowed into the persistent detections. Lowering this value allows us to track the tennis balls at super low confidence levels. This is okay because we can use this configure YOLO action to adjust the minimum confidence of each class individually. In behaviors that grasp tennis balls, we make the low confidence threshold safe by requiring each tennis ball object to be within a reachable boundary before grasping it.
The configure YOLO action also supports enabling multiple YOLO models at once. The behavior author picks the set of YOLO models that will be available at this time and all the class settings for each. This is useful because sometimes we want to perform tasks involving objects which are detected by different YOLO models.
Persistent Detections
Another thing we want to do is configure the persistent detection parameters themselves.
This is accomplished via the CONFIGURE_PERSISTENT_DETECTIONS scene action type, seen in 2.
There are four parameters for the persistent detections: pose filter alpha, acceptance confidence, stability frequency, and history duration.
3 illustrates the instant and persistent detection management process, which happens continuously in the behavior scene.
The parameter \(\alpha\) is the pose filter coefficient used when initializing each persistent detection, \(c_{acc}\) is the acceptance confidence threshold, \(f_{stab}\) is the required stability frequency, and \(T_{hist}\) is the history duration maintained by the persistent detection.
For the tennis balls, we set the acceptance confidence threshold to 0.1 and the stability frequency to 0.5 and execute the scene action. The result is shown in 2 where 3 sports balls are tracked as persistent detections. Note the confidence levels are not too bad in this case at around 50–60% and the 3 labelled coordinate frames are shown in the 3D view.
Setting up an Object
Now that our list of stable persistent detections is populated, we will use a SETUP_OBJECT scene action to “lock on” to the closest tennis ball.
In 4 you can see we’ve added a scene action node called “Lock Onto Sports Ball”.
In the settings area for that node, we’ve selected the SETUP_OBJECT type, the yolov8n-seg YOLO model, and the sports ball YOLO class.
A timeout setting is also available.
Because there may not be matching stable detections in the scene when this action is run, the timeout is a way to wait for one.
The setup object scene action will wait up to the timeout while continuously searching for a match.
If it finds one, the action will immediately complete successfully.
Else, the action will fail when the timeout is reached.
A minimum history size setting is available to further filter our selection of persistent detections.
A shorter history size increases responsiveness but increases the risk of false positives and a longer history time requires tracking an object for longer before it becomes an object candidate.
The setup object scene action, when successful, creates or updates an “object”. Our concept of an object does not have to be an actual object, even though it often is. They are privileged maintainers of reference frames that are subsequently usable for action definitions. There are two types of objects: direct and derived. The types are shown in 5. The derived types allow us to model articulated objects like doors and implement heuristic environmental feature extraction, such as for table edges.
SETUP_OBJECT scene action. Direct objects are based on persistent detections, whereas derived objects are constructed from named frames, paired detections, or depth-based geometric calculations.4 also shows the result of executing the setup object scene action. In the top right, a sports ball can be seen in the “Objects:” area. For each object, we show its world frame pose information and any referenced persistent detections. Here we can see the YOLOv8 persistent detection that’s attached. That persistent detection is still in the stable detections list, too – it doesn’t get removed. However, when a persistent detection attached to an object stops tracking, unlike in the stable detections area, it doesn’t get removed. Instead, it renders as grayed out but is still usable by behavior actions in its last seen pose. A larger coordinate frame is shown for the sports ball object in the 3D view.
Freezing Scene Objects
Another scene action type is FREEZE_OBJECT.
Freezing an object disables its pose from being updated, causing it to stay in place with respect to world frame regardless of further perceptual tracking.
The ability to freeze an object at any point in the course of a behavior is important for manipulation and dead reckoning.
When grasping an object, it’s best to freeze it just before occluding it in any way. This is because partial occlusion can corrupt the object’s pose at the time when it matters most. For example, we usually have several pre-grasp hand poses with a freeze in between. The first pre-grasp gets the hand as close as possible without occluding the object. The second pre-grasp pose gets the hand where the fingers can grab the object. It’s best to put a freeze action between these two.
The reason the pose will be corrupted is because our YOLO detections contain the 3D depth points that lie in the segmentation area. A YOLO persistent detection is posed at the position of the centroid of those depth points. If part of the object is occluded, the segmentation will be cropped, causing the centroid to move away from the hand. As the hand closes in on the object, the position of the object can vary drastically, depending on the shape of the object, the hand, and the viewpoint.
Freezing objects can also be useful for dead reckoning with respect to objects that were seen in the past. For example, our door traversal footstep plan is authored with respect to the door frame, which is ultimately based on the detection of the door opening mechanism which we can usually no longer see or choose not to see after the door opening. The frozen frame from the handle pre-grasp is usually used throughout the remainder of the behavior.
Setting up a freeze scene action is easy.
As shown in 6, we created a scene action named “Freeze Sports Ball”, set the action type to FREEZE_OBJECT, and set the YOLO model and class for the sports ball.
When executed, if there is a matched object in the list, it is frozen; otherwise it will wait until the timeout and fail if there is still no match.
In the top right of 6, you can see the sports ball object is now marked as “FROZEN”.
Other Scene Action Types
There are a few other simple scene action types at the moment.
DELETE_OBJECT removes a matched object from the list and from being available for actions.
CLEAR_SCENE removes all objects from the list.
FREEZE_SCENE freezes every object in the scene.
CONFIGURE_FOUNDATION_POSE is an unimplemented placeholder for configuring FoundationPose.
Authoring a Frame Based Arm Action
Now that we have locked onto a sports ball, we’ll show how to move the arm with respect to it. We’ll go over a full grasp sequence later in the real robot example. In 7, we’ve added an arm action. When using the taskspace mode, as opposed to the “Use Predefined Joint Angles” mode, there is a “Parent frame” drop down. The list of available frames includes a frame for each privileged object in the scene. In this case, it’s just the sports ball. We select the sports ball from the list and that’s it! The action is now defined as a hand pose relative to the object.
For position-only objects like the YOLO sports ball, we use the orientation of the robot’s chest so frame relative actions are always valid for the robot’s current approach angle. In 7, the sports ball frame is frozen, but when objects are not frozen and actively tracked, the arm goal pose will continuously update based on the objects current pose. The inverse kinematics solution for the arm will likewise continuously update and be displayed in the 3D view.
When we save this behavior, the JSON will contain an object relative translation and rotation with respect to the object. This makes the action reusable for later runs and future instances of this object.
Door Traversal Footstep Plan Example
In this section we’ll recount an online, real robot session where we re-authored a set of door traversal footsteps. In 1, the robot is situated in front of a doorway and an existing footstep plan action is selected. At this point, we had tried executing this footstep plan a couple of times, but fell on each attempt. Since this footstep plan was designed to be robust in the presence of a spring closer, it has an overly difficult side stepping sequence in the beginning. In this case, the door was open and didn’t have a spring closer, so in the interest of task success we decided to re-author an easier to execute footstep plan on the spot.
Manually Defining a Footstep Plan
In the walk action settings for the “Manually place steps” mode, there are several features that allow the operator to manage the footstep plan. To enter the editing mode, which makes the feet graphics selectable in the 3D view with gizmos, the “Edit Manually Placed Steps” box is checked. The settings area displays the current number of footsteps in the plan as a verification component and to be sure there isn’t one off the screen somewhere. On the next line, there are buttons to append a left or right footstep to the end of the plan. There is also a button to remove the last step of the plan. These editor features are simplistic but enough to get the job done.
To clear the plan, we first click the remove last button a bunch of times until the plan is empty. Then, we click the “Right” button to add a right footstep to the plan. We click the footstep in the 3D view and move it to the desired location using the gizmo, as seen in 2. This process is repeated until the plan is complete. The completed door traversal footstep plan can be seen in 3. This plan took about 1 minute and 30 seconds to create.
Fallback Mechanism Example
Now we will cover fallback node operation. A fallback node is like an if-else statement, but since the if condition evaluation is performing an action, we call it a try-catch. A fallback node therefore has two parts: the try and the catch. Each part can contain one or more actions, however, unlike a Behavior Tree fallback node from the literature, our fallback node has only one try clause and the catch clause is just an action sequence of unlimited length.
We also support concurrent actions in the fallback node. To do this, we expand the try to contain the first concurrent action set. This could be one or more actions with the rule that they must execute together. If any action of the concurrent sequence fails, the try fails and the catch is executed once all try nodes are finished executing. Else, if the entire concurrent try set succeeds, the catch is skipped and the node after the fallback is next for execution.
Our fallback node does not currently support nesting, but supporting it would be a desirable improvement. Nesting try-catch blocks could increase behavior expressiveness. In general, it would be good to look through the most used mechanisms in programming language and think about how they could be included as runtime-editable behavior elements.
In 1, we have started on a fallback example demonstration, where we have added a fallback node and a try node. In this demonstration, we want to check if the right hand is raised to a certain position and raise it if it isn’t. Once the hand is raised, we will yaw the spine. To check the hand position, we have added a “shape contains” condition node as the try action and named it “Check hand”.
Condition Nodes
We currently have six types of condition nodes: ALWAYS_FAIL, ALWAYS_SUCCEED, COUNTER, LLM, PROXIMITY, and SHAPE_CONTAINS.
A condition node is different from an action in that it does not directly perform an action, but instead is responsible for making a decision resulting in success or failure.
Success and failure for a condition node, however, have the same effect on behavior execution as actions: if the node succeeds, the behavior keeps going and, if the node fails, it halts execution by disabling autonomous mode.
The exception is that when a node fails in the try part of a fallback node, the behavior keeps going by executing the catch.
The “always succeed” and “always fail” condition node types are useful for testing and operation. The “always fail” node is especially useful if you want to test out a part of the behavior in autonomous mode but have it stop at a certain point. In this case, we’ll often add a temporary “always fail” condition node named “STOP”, as we did in [sec:wait_nodes_dependencies].
The counter node maintains a persistent count and a tunable limit. Each time it is executed, the count is increased. If the count reaches the limit, the counter condition node fails.
The LLM condition node provides the ability to query a large language model to determine success or failure. It has an authorable system prompt, repeated prompt, and boolean response matcher. We covered the LLM node in more detail near [fig:2025_llm_condition_node].
The proximity node checks the distance between two behavior reference frames. If the distance is between the tunable minimum and maximum, the condition succeeds. We also covered the proximity node in some more detail near [fig:2025_proximity_condition].
For this example, we’ll use the shape contains condition node to check that the right hand frame is within the bounds of a sphere. In 1, our shape contains condition node is selected with the settings showing in bottom left. There are two modes for the shape contains condition: contains frame and contains points. We’ll be using contains frame in this example, but the contains points functionality, which counts the number of live point cloud points in the virtual shape, is covered near [fig:alex_shape_contains_condition].
Shape Contains Frame Condition
In 1, you’ll see there are four main settings for the shape contains frame functionality. The shape parent frame, the shape pose, the sphere radius, and the frame to check containment of. We allow defining the shape’s pose with respect to any robot or object frame just like we do for actions. Checking the “Adjust Shape Pose” box enables the shape to be moved in the 3D view with a pose gizmo, as shown in the figure. Currently, just spheres are supported, but adding more types of shapes would be a good improvement.
Pairing this condition node with our fallback gives us a mechanism to react to an unknown state, like where the hand is. To robustify our behavior and apply corrective action if the hand is not where we want it, we will now add am arm action to the fallback catch, as shown in 2. When we add the catch action, because it is not concurrent with the condition in the try, we see a thin horizontal line in the tree editor. This line, between the “Check hand” and “Arm action” nodes, shows us where the try-catch delineation is.
The Goto Node
To show how you can implement while-loop-like behavior, we add a goto node after the arm action in the fallback catch, as shown in 3. The goto node has a goto field that refers to another node by name. When the goto node is executed, the next execution index is set to that node. Here we have selected the “Fallback node” as the node to goto, which is a general way of saying to reexecute the try, as the “Fallback node” is a control node, not an action. When we point to control nodes, it’s effectively the same as pointing to the next action.
Since we want to be able to save this goto reference by name in the JSON, there is the possibility of ambiguity in the node’s name. To handle the case when there are multiple nodes of the same name in the tree, we search for the node to goto by proximity to the goto node. The closest matching node is chosen with levels of priority: children recursively come first, then siblings, then parents recursively. Our choice of priority is somewhat arbitrary. We recommend using unique node names where possible to avoid this ambiguity.
Reactive Fallback Demonstration
Finally, we add our spine action to turn the spine 17 degrees to the left as shown in 4. Notice this node is not a child of the fallback node, but a sibling. We run through the behavior in the manual mode, inspecting the flow of execution. We run check hand, it fails, then the arm action runs to move the hand into the correct location. The goto node causes us to re-evaluate the check, which now succeeds, sending us on to the spine action, which executes successfully.
Real Robot Example: Repeated Door Opening
In this final section of the usage guide, we’ll walk through a real robot authoring session in which we created a repeated door opening behavior from scratch in 32 minutes. For this experiment, we used the Unitree H1-2 robot. At the end of the authoring phase, we immediately entered a reliability test where we performed over 30 successful door openings with continuous autonomy. We didn’t run it until failure, just until we got bored.
Starting Things Up
Starting things up typically requires powering on the robot and the operator computer. In this guide we’ll assume the robot is tetherless but starts on a hoist to get it into its initial standing state. We’ll assume powering on the robot will also power the computers and that they turn on and that the motors have power.
After the robot’s on-board computers have started, we will deploy the latest version of the control and autonomy software if necessary. Once that is done, we launch the control and autonomy code on the robot via an SSH command line session and start the robot operator interface. The next step is to set the robot down and transition the controller to walking. We have a stand prep state and a walking state, with buttons in the operator interface to perform those transitions.
Standing and the Operator Interface
The robot should now be standing, and the operator interface should be showing the robot state and perception data. It should look like 1. At this point we check a couple of things to make sure everything is ready to go:
-
The robot visualizer is enabled via the checkbox, the robot is visible in the 3D scene, and the update frequency is as expected, in this case 100 Hz.
-
YOLOv8 is enabled with the checkbox and is running at some frequency in the 5-20 Hz range.
-
The ZED X Mini colored point cloud is enabled with the checkbox, is running at 30 Hz and is being updated in the 3D scene.
-
The YOLO annotated image is showing, with any detections shown that you would expect, in this case, the door lever and that it is at the expected confidence level.
-
The behavior tree view shows the synchronization frequency at around 30 Hz.
-
Stable detections appear for the YOLO detections in the scene panel.
To start, we’ve created sequence nodes for our door opening behavior, with the top level being saved to JSON with the “*.json” extension. Saving the behavior frequently is important to avoid losing work. We’ve also loaded the go home skill as a convenient way to reset the robot when we need to.
The Ability Hand Action
We will be using the left arm and hand to turn the lever and open the door.
The first thing we need to do is specify that the finger configuration should be “flat hand” at the start.
To do this, we add a left Ability Hand action, as shown in 2, and set the “grip” field to FLAT.
There are several common grip types available as presets, including OPEN, CLOSE, PINCH, FLAT, HOOK, RELAX, DOOR_LEVER_OPEN, DOOR_LEVER_CLOSE, DOOR_LEVER_CRUSH, KEY_OPEN, and KEY_CLOSE.
It is also possible to set the individual finger joint angles for an Ability Hand action, as shown in 3. The Ability Hand has six degrees of freedom, each with a slider and a velocity setting. If the ability hand comes before an arm action and is concurrent with that arm action, it can be previewed as it is tuned, as shown in the figure.
Arm Ready Action
Next, we’ll add a “left arm ready” action, as shown in 4, which gets the hand and arm into a state where we can start to approach the handle from a known configuration. Constraining the grasp approach in this way helps to robustify the behavior. Otherwise, the hand might come from varying angles and may result in finger-handle collisions or unreliable inverse kinematics solutions. The ready action brings the hand up from the robot’s side and to roughly the orientation used for grasping the handle, but keeping a 10 cm or so distance away from the handle and any collisions.
Additionally, we typically will use joint angles or a robot-relative hand pose to define the arm ready actions, as they don’t require a perceived object. This helps us in using the action as a reset, regardless of the state of the environment. We’ll reset back to this action later when testing out the door opening components. In this case, we define the hand pose in pelvis frame.
Next, we’ll create a scene action to lock on to the lever handle and another Ability Hand action to curl the fingers and pull the thumb out and back in preparation for handle contact. In 5, we’ve already run the scene action and have tuned the finger positions. We used the sliders on the hand action to form our desired grasp for the lever handle. The Ability Hand rubber is actually grippy enough that we won’t need to close our fingers around the handle. We then execute the Ability Hand action.
Pre-Grasp Actions
In 6, we create our “Pre-grasp 1” action. For this grasp, we will use two pre-grasp actions. The first one hovers the hand 3-4 cm above the handle so the fingers don’t hit the handle when getting into position. Avoiding finger collisions is important with our system, as it will try hard to get to the desired position. Given the somewhat delicate Ability Hand fingers, if you clip the fingers on things while moving the hands around, you run a real risk of breaking a finger. Plus, it’s just nice to not have unnecessary collision.
This pre-grasp action is the first time in this behavior we need to be fairly precise in the range of a centimeter or two. It can be difficult to gauge the hand’s pose with respect to the lever using the desktop monitor point cloud and first person view. It’s a little easier in VR with stereo vision. Since we weren’t using VR for this demo, we addressed the problem by sitting in line-of-sight to the robot. The tuning process for this pre-grasp action is presented in 7.
Manipulation Action Tuning Process
We tune the first pre-grasp action until the only thing left to do to grasp the handle is to move the hand directly down such that it rests on the lever handle. 8 shows us authoring this action. We repeat the visual guess-execute-inspect process for this pre-grasp action.
Turning a Door Handle
Now that our hand is resting on the door lever handle, it’s time to turn the lever and unlatch the door. This is accomplished using a screw primitive action, as shown in 9. To tune a screw primitive, the first thing to do is set up the screw axis, which is the white dotted line shown in the figure. It is moved using a pose 3D gizmo, just like the hand. In general, the screw axis should be aligned with the rotational axis of the object you are manipulating. In practice, we tend to have to move it some to get the desired robot motion, which is, in the end, the important part. In addition to the axis, there are the translation and rotation amounts. In practice, we just have to use a tuning loop like in 7 to guess at these values and re-execute until we get the desired result.
For door lever turning we typically leave translation at zero and adjust the rotation to be more than necessary. We also have to move the axis above and to the side opposite to the hand in order to result in a position controlled motion that achieves the necessary forces to turn the lever properly.
The iterative tuning process for a screw primitive requires an extra element for step two. Since the screw primitive is a motion relative to the hand’s current position, playing it back a second time will cause the hand to travel further and further along the helical motion profile, rather than resetting to the beginning. For this reason, we need to reset back to the next previous arm action, in this case pre-grasp 2, in order to re-execute the door lever turn action.
Additionally, for the door lever turning action, we must make sure the handle is turned enough to unlatch the door sufficiently. This is why we also perform a direct visual inspection of the latch as part of this tuning process.
Opening a Door
The door opening is done as another screw primitive action, but along a different axis: the door panel hinge. Aligning the screw axis to the door panel hinge is done using the point cloud as shown in 10. It doesn’t have to be super precise – within a few centimeters in the X-Y axis and 5 degrees rotationally is fine.
For door opening, the tuning process is a bit easier than the lever turning. This is because the hand is fairly compliant to the door since it’s applying a force on the handle and there is no unlatching requirement on this one. The direction of the axis alignment will determine whether the screw rotation value for opening is positive or negative.
Looping a Behavior
Once we finish up the door opening action and execute it successfully, the repeated door opening behavior is pretty much done! We set the execute after field of the pre-grasp 1 action to the scene action to curl the fingers at the same time. Then, we add a goto action at the end that goes back to the beginning, to create an infinite loop. We then check the execute “Autonomously” checkbox and let it spin. This behavior executed 33 times in a row successfully before we stopped it.
In this guide, we have covered what all the main action types do and how to use them. All behaviors presented in this thesis build on these basic concepts to create more complete and longer-horizon behaviors. In the next chapters we will cover the scientific and theoretical contributions of our work.