Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

7.5. Evaluation

So how well did our hypotheses hold up? First, let’s state them again:

  1. Robot-local execution with synchronized UI state, concurrent action layering, reactive tree logic, and behavior-time semantic perception yield door behaviors that are faster and more reliable than prior IHMC baselines and competitive with reported reinforcement learning systems on overlapping door tasks.

  2. Runtime-editable behaviors and perception modules reduce the iteration loop required to diagnose failures, modify logic, and re-test on the robot, relative to redeploy, restart, or retrain workflows.

  3. Decomposing behaviors into reusable primitives, subtrees, and scene actions allows new door and loco-manipulation variants to be brought up by editing a small part of a working behavior rather than rebuilding it from scratch.

We can say that our first hypothesis holds. We showed that our system with robot-local execution, synchronized UI state, concurrent action layering, reactive tree logic, and behavior-time semantic perception is sufficient to yield door behaviors that are faster than a prior IHMC baseline in Figure 7.1. The 2021 Atlas door behavior took 3.9x longer to execute than our 2024 Nadia pull door behavior. We also showed that it was competitive in speed with reinforcement learning policies in Figure 7.29. Our door behaviors are in the same tens of seconds regime as Zhang et al.  [49] and DoorMan  [75].

For reliability, unfortunately, there is a lack of data for prior IHMC baselines, but in Figure 7.14, we show that our loco-manipulation behaviors, namely, door approach and opening, were able to be repeated 11 and 12 times without failure. Likewise, evidence for door behavior reliability on learned systems is limited. Zhang et al.  [49] show the most convincing reliability with their 20/20 and 18/20 trial sets for full door traversals. DoorMan  [75] claims “83%” reliability, but does not share the data from which that number is calculated. We can say that the reliability of our system is competitive with learned door systems, especially when the resilience capabilities are factored in, which are also presented in Figure 7.14.

We can also say that our second and third hypotheses hold. Runtime-editable behaviors and perception have reduced our behavior authoring times to the hours regime as shown in Figure 7.20. Creating new behaviors from scratch has been shown to require failure diagnostics, logic modification, and re-testing on the robot, which we have shown in our authoring videos and tables and described. While we don’t have direct comparisons with prior IHMC baselines, Figure 7.32 suggests a drastic reduction in the iteration loop required to create new behaviors. This figure spans a decade of IHMC real-robot behavior milestones from the 2015 DRC Finals through the Alex loco-manipulation demos in 2026. What we see is a drastic increase in the rate of new behavior demonstrations over time given the same or less resources allocated to creating them.

The literature does not provide data on the creation or modification timelines of loco-manipulation behaviors. Standing alone in measuring this is one of the most novel aspects of our work. However, given knowledge of DoorMan  [75]’s architecture, we can estimate that our behavior authoring process is 50 to 100 times faster, as illustrated in Figure 7.31. We credit this speed to our architectural decisions where we decompose behaviors into reusable primitives, subtrees, and scene actions. This allows us to edit a small part of a working behavior rather than rebuilding the whole thing from scratch. The ability to perform these edits and re-test the behavior at runtime speeds up the process even further.

Figure 7.32. Calendar timeline of IHMC real robot behavior milestones since the 2015 DRC Finals. Colors distinguish teleoperation, hybrid or supervised operation, legacy hard-coded autonomy, and runtime-authored door, manipulation, and loco-manipulation behaviors. Entries include teleoperated and non-loco-manipulation demos from the development story in [Building our Behavior Architecture](4_story_of_building_and_architecture.md#ch:building).

7.5.1. Desirable Characteristics

So how well did we achieve the desirable characteristics we defined in Desirable Characteristics? We think we did pretty well.

Our system is highly capable. We demonstrated dozens of different task varieties, from door traversal types and sorting objects on tables to building exploration. A taxonomy is illustrated in Figure 1.3.

There are, however, classes of tasks we cannot do, such as grasping objects while the robot is walking, dynamic bracing, tasks that require proprioception, and fast manipulation tasks such as playing ping-pong. We cannot yet do them because the necessary control and perception components do not yet exist. Dancing, swimming, and sports are probably classes of tasks that this system is not well suited to address because they are too continuous, too dynamic, and even artistic.

Demonstrating our system was certainly feasible, as is shown by our results on real hardware. However, we did not show that it was feasible to take our system off-site and reproduce the results.

Our system definitely supports fast behaviors. We feel that the videos included of our real robot demonstrations are watchable at 1x speed.

It also supports parallelism through moving multiple parts of the body at once, including being able to manipulate a door panel while walking.

We showed that our system is capable of producing reliable repeated-run behaviors for door approach and opening and for sorting balls on tables. Our walking controller was, however, not very reliable for the door traversal walk-throughs.

We showed how our system supports robustness to environmental disturbances through our pull door reactivity demo and our ball sorting demo with human disturbance. This goes hand in hand with our resilience capabilities. By using fallback nodes with condition nodes, common failure modes can be handled.

For longer-tail resilience, we support this through operator-robot teaming. When behaviors fail, the human operator might be able to solve the problem with edits or additions to the behavior.

Our system is independent from external systems when it is in autonomous mode. It does not rely on external comms or compute in this mode. Our perception is entirely from on-board color vision.

Our system is designed around adaptability of the human-robot team. We have shown this through our results in behavior authoring. To achieve this, we made our system observable, predictable, and directable.

Our system is learnable, and we have a handful of trained expert operators. We have provided a usage guide in Usage Guide.

We hope the reader will find our system understandable through reading this thesis, watching the videos, and browsing the source code.

We think the interface is easy to use, as suggested by our fast from-scratch authoring times.

Our system is able to be analyzed after behavior runs. We have done this via screen recordings of the operator interface and the log system shown in Section 4.6.9.

Our system is debuggable by looking at logged data and by rerunning behavior logic in an application we call the “behavior test facilitator”. This application uses a kinematic simulation combined with the playback of real perception data from a logged run. Very often, bugs can be reproduced and fixed in this environment. We also use this application to automate behavior tests.

Lastly, we view our system as being very extendable. As laid out in Building our Behavior Architecture, the story of building it has been feature after brainstormed feature. This train of extension is still in active motion.

References cited on this page

[49] M. Zhang, Y. Ma, T. Miki, and M. Hutter, “Learning to open and traverse doors with a legged manipulator.” 2024. Available: https://arxiv.org/abs/2409.04882

[75] H. Xue et al., “Opening the sim-to-real door for humanoid pixel-to-action policy transfer,” arXiv preprint arXiv:2512.01061, 2025.