8. Discussion

We are pretty happy with what we’ve been able to accomplish with a model based approach. The combination of DARPA Robotics Challenge inspired Coactive Design for humanoid robot operation, affordance templates, behavior trees, and a behavior-managed perception scene has proved to work very well. In a 10 year period, we went from a 20 person team working tirelessly over years to get a robot doing 8 tasks via teleoperation, to essentially one person being able to set up a fully-autonomous multi-station ball sorting behavior in hours.

Much of the field, and especially startup robotics companies, have now abandoned these fundamental approaches and have jumped ship for “end-to-end” learning. However, we are convinced that a lot of tasks can be solved both ways. In fact, with the increased prevalence of intelligent coding assistance, our more classical, model-based approach may even be able to keep pace with learned systems.

8.1. Strong Points

We think our robot-local runtime-editable design is really the strongest success of our system. Running the compute locally robustifies the system by decoupling it from external digital disturbances such as communications failures and compute failures on remote servers. Editing behaviors online has enabled rapid bringup of new behaviors and modification, composition, and extension of existing behaviors as we have shown.

We also think we did well with the interface between the behavior system and the robot controller. The ability to schedule asynchronous footstep, arm, leg, spine, and neck commands enables a high level of behavioral expressiveness. It also supports the integration of classical planners and AI assisted behavior composition.

The next most successful part of our system is probably the authorable behavior-time perception through scene actions. Perception systems have a lot of important and non-intuitive quality levels. Furthermore, these quality levels can vary with lighting, situation, and hardware differences. The metrics of quality include depth accuracy, semantic object detection confidence, latency, and frequency. Our design designates an expert human operator to evaluate and exploit available quality levels for each behavior situation to achieve the goal at hand.

We also think a strong point of our system is its ability to be combined with learned systems. For example, as mentioned in Building Our Behavior Architecture, we built an action node that executed a learned mimic policy, such as a dance or a door traversal walk through. More work in this direction could enable more robust door walkthroughs. We think this is a good way to get the best of both worlds for the time being.

8.2. Weak Points

There are important caveats, however, with our current system. For one, our object-on-table manipulation demos center around a bottle pickup and balls. We haven’t cracked the nut on perceiving orientation of objects. This is why our premier sorting demos use spherical balls. Since they are symmetrical objects, only position is required to grasp them from any direction. We can even sometimes grasp the balls while they are moving. We have preliminary results using FoundationPose for object orientation estimation, but it is not yet working well enough for repeatable tests.

Another major weak point is the lack of support for executing sequence subtrees that return when they are done. This would solve two existing problems in the system. For one, it would solve the problem of having two of the same JSON file loaded in the tree at once. When the same JSON file is loaded twice in the tree, changes to one instance are not reflected in the other, and saving one can overwrite the other. If sequences can be run as a subtree and returned when done, identical subtrees throughout the system can be instantiated once instead of duplicated.

We have also lacked the ability to call a subroutine with one line. In general, making the behavior tree mirror the structure of computer programs more would greatly increase the expressiveness of the system. We think a good strategy would be to keep thinking of the next most useful thing to add, guided by operator frustration in trying to create new behaviors.

We also have a scaling problem with large trees, both in the view and in the data synchronization. We were able to use trees consisting of well over 100 nodes, as we presented in Building Our Behavior Architecture with the ONR Demo that consisted of 178 nodes. However, we don’t think this could grow much further without some optimization. Two issues stand out. First, when loading in large subtrees, there can be a loading delay that can cause a glitch in the network synchronization. Second, for trees with a tall maximum depth, the behavior tree editor view entries near the leaves of the tree can become indented too much and hard to work with. We think both of these issues are very solvable.

8.3. Natural Next Steps

Throughout this work, we really want to add force-based action primitives. The codebase even includes a “wrench action” node, which was used to pick up boxes. Ultimately, we just didn’t have the bandwidth to approach it from a controls perspective.

The wrench actions for box lifting worked only under very special circumstances. The way it worked was to depend on prior position based arm actions to place the hands on either side of the box. Then, the wrench action would command inwards accelerations for the hands to result in a force on the sides of the box. The wrench action was never completed with a termination condition, so it was always a hack when demonstrated.

It would be a natural next step to develop a force-based action definition in combination with good support for it in the whole body controller. One area where we would like to use it is for turning door handles. Door handles have different spring constants and might require a torque based ramp if the handle isn’t moving. A hybrid force-position action might do the trick well—force-controlled angular velocity along the screw trajectory until the end stop is hit. Lots of tasks, like opening drawers and sliding things against surfaces, could take advantage of such an approach.

In general, force-controller primitives can be a more natural way to interact with the world. In fact, it can even be used as a form of proprioception. For example, whether doors are push or pull, and whether they are locked, is not always visually identifiable. One often has to try pushing in various directions to figure out what is going on with articulated objects.

Another area of work that would be nice to explore would be grasp planning. We have seen that there may be some promising development in grasp planning, but haven’t explored it at all yet. For example, when going to operate tools designed for humans, it would be helpful and, we would think, feasible to plan a 5-finger grasp for them instead of having to manually specify finger configurations.

The problem doesn’t have to be solved generally. For example, if the behavior authoring interface was adapted to virtual or augmented reality, the user could simply use their hands to demonstrate an example grasp technique to seed the planner.

Our screw primitive action, inspired by [14], is a very versatile manipulation planning abstraction. We think there could be useful constructs like this that extend beyond positional sliding and turning. There may be interesting ways to model forces, torques, or the way objects can deform or move, using a similar approach. For example, there is some work like this on clothes folding where geometrical primitives are used to define folding patterns [90].

Another fruitful area to explore would be navigation. In the behaviors presented in this thesis, even for the “building exploration” ones, we didn’t ever form a map and plan from point A to point B. They were more like a slightly reactive scripting of which door to go through next, with some stand-in-place visual searches. Our building exploration demos have been limited to a maximum of 3 doors separating spaces in an open lab environment. The main reason for that is that we haven’t had a fully tetherless robot to work with yet, so we couldn’t take it into non-lab spaces.

We think our approach to navigation would follow a similar pattern to our behavior-time scene: keep a behavior-local map that has authorable behavior-time interactions. This is because the behaviors are the primary consumer of this information and are the most informed about what they need and when. We would likely maintain very different map representations for different task levels. For navigating from room to room in a building, a topological graph structure would be the most useful. For traversing a space, a heightmap or octree might be the most useful. For locating objects in dense spaces, a detailed scene graph might be the most useful. It would be exciting and interesting to explore options for extending our behavior scene to include both maps and behavior specific planners.

It would also be great if we could use virtual reality while authoring to record arm motions. These recordings could then be used directly as arm actions. This would enable the operator to quickly define complicated motions that are hard to specify with geometric authoring alone. That capability could also be extended by filtering the motions or by collecting repeated demonstrations and training a vision-action policy. It would be valuable to demonstrate and dispatch vision-action model training runs from within the authoring interface. On training completion, the resulting policies could be used as an alternative mode of the arm action.

Taking a step back and looking at our system as a whole, there’s also an important gap in handling open world environments. We focus on a human operator’s ability to author robot behaviors rather than robot-generated ones. The lack of a generative ability on the robot, where it could “author” its own behavior, limits our system to well thought out human constructed scenarios.

8.4. Why Not Fully Embrace Behavior Trees

One interesting part of this thesis is that we use a “behavior tree” architecture, but left out some of the key mechanisms from the literature version [60]. The most important part of why is just that we built what appeared to work best for us. The other aspect is that the lone result from the literature just wasn’t that convincing [62], being ultimately very slow and locomotion only.

The reason we don’t tick from the root node is that, for authoring behavior at runtime, we have the robot in a particular configuration and doing a particular thing and we are trying to author a particular part. This means we want a “cursor”, just like in a text editor, that marks our current editing location.

It may be beneficial to introduce the root node tick system for autonomous mode. This has the prospect still of enabling the behavior to drastically change goals or strategies at basically any point in the behavior. However, behavior trees don’t really address the full problem here, as the robot might be in a situation, such as when going through a door, that it can’t immediately get out of. In dynamical situations, the robot really has to finish what it’s doing before aborting the mission. In these cases, it needs to execute specific “abort mission” behaviors that are very dependent on current robot and task configuration.

This is where we guess Behavior Trees would suggest using the blackboard. However, the Behavior Tree blackboard is an unstructured design component and we would strive for a more formal mechanism for doing things like this. A promising theory for this sort of reactivity includes sequential composition [59].

8.5. The Future

We would like to see a near future where technical people can buy or build their own robots and put them towards useful work. To us, it is most convincing if the system is open source, understandable, and tinkerable. We would like to be able to modify and extend the capabilities of personally owned robots without being beholden to subscriptions or AI datacenters. We think it is possible to build such a system using classical approaches for planning and control and simpler neural nets like YOLO and FoundationPose for vision. This is in direct contrast to where much of the industry is going. We imagine that if we extended our runtime-editable system to a portable device like an augmented reality headset, and with more development, we could author behaviors for common household chores and basic outdoor contracting work. It could be that the real world is too diverse for anything other than large-scale end-to-end learned systems, but we’d like to explore and try to find out where the boundaries between classical and learned systems really are.

Given the above dream, we would definitely like to keep using our system into the future and keep expanding it to new capabilities and modalities. We are passionate about the approach taken in this thesis and think it has a bright future. A lot of smart people contributed to this system over more than a decade and we are honored to present such positive results.

Thank you to the reader for hanging in there for what turned out to be a very long thesis. We hope we have sparked some interest in our unique approaches to problems and that you got some value out of it. Cheers!

References cited on this page

[14] A. Pettinger, F. Alambeigi, and M. Pryor, “A versatile affordance modeling framework using screw primitives to increase autonomy during manipulation contact tasks,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7224–7231, 2022, doi: 10.1109/LRA.2022.3181732.

[59] R. R. Burridge, A. A. Rizzi, and D. E. Koditschek, “Sequential composition of dynamically dexterous robot behaviors,” The International Journal of Robotics Research, vol. 18, no. 6, pp. 534–555, 1999, doi: 10.1177/02783649922066385.

[60] M. Colledanchise and P. “Ogren, Behavior trees in robotics and AI: An introduction. CRC Press, Taylor; Francis Group, 2018. doi: 10.1201/9780429489105.

[62] A. De Luca, L. Muratore, and N. G. Tsagarakis, “Autonomous navigation with online replanning and recovery behaviors for wheeled-legged robots using behavior trees,” IEEE Robotics and Automation Letters, vol. 8, no. 10, pp. 6803–6810, 2023, doi: 10.1109/LRA.2023.3313052.

[90] M. Moletta, M. K. Wozniak, M. C. Welle, and D. Kragic, “A virtual reality framework for human-robot collaboration in cloth folding,” in 2023 IEEE-RAS 22nd international conference on humanoid robots (humanoids), Austin, TX, USA, Dec. 2023.