Bootstrapping Automation with Teleoperation and Data-Driven Reinforcement Learning
A new job figure is silently emerging: teleoperators, that is human piloting robots. While remotely completing a task is useful by itself, there is much more to it. Every successful trial can be logged, building a dataset of experiences. If done properly, we can build an infinitely reusable learning resource to train any number of autonomous robots to perform the same tasks. Here I will go over the potential and unsolved problems in teleoperation, review selected projects in data-driven learning and speculate on the evolution and opportunities in the nascent teleoperation industry, with focus on manipulation.
Introduction
Building useful AI agents is hard. Since we didn’t figure out a priori how to build general intelligence, the best we can as of today is to take a statistical regressive approach: build a huge dataset which incorporates the behaviour we would like to see, take a predictive function with a lot of free parameters and finally write an algorithm to tune these parameters so that the function is often correct when faced with a data point similar to the ones in the dataset. You may not like it, but the bitter lesson is that it works reasonably well, since we have powerful computers.
Now, building useful physical AI agents, a.k.a robots, is harder. Traditional approaches require online interactions: the robot is learning while performing the actions. While this works well in virtual environments such as video games, deploying a baby robot to learn in the real world is costly and dangerous; moreover such online algorithms are not suited to reuse past experiences. To overcome these problems, a promising direction is data-driven reinforcement learning (also named offline-RL or batch-RL), in which we train agents offline on a dataset of already collected experiences. The training can be done virtually and then the learned skills, usually in the form of a neural network, can be deployed to a real robot. If we assume that offline-RL works well we are now able to reuse past-experiences indefinitely, but we still haven’t solved the most crucial problem: the access to a large dataset. As of today, we simply miss the large datasets of robot-experiences needed to power the learning, in the same way as large amounts of labelled pictures and text powered advances in computer vision and natural language processing. Teleoperation has entered the chat.
Teleoperation
Teleoperation, or Telerobotics, is about separating the brain and body: the human operators control the movements and take decisions, the robot executes. In fact there are different degrees of teleoperation:
- Direct Control: The operator is controlling the motion of the robot directly and without any automated help.
- Shared Control: Some degree of autonomy or automated help is available to assist the user. Such autonomy is set in advance and is fixed.
- Shared Autonomy: Same as Shared Control, but the level of autonomy is adjusted dynamically (and autonomously!) according to the situation.
- Supervisory Control: The control happens at a very high level, the robot executes nearly all the functions autonomously.
(to go deeper, check Autonomy in Physical Human-Robot Interaction: a Brief Survey and the classic reference (cap 43)).
The robot can really be anything: a drone, a manipulator, a vehicle or a humanoid. The robot hardware often dictates how we command the robot; a non exhaustive list of controllers includes joysticks, steering wheels, virtual reality kits, twin robotic arms, haptic controllers and suites, electromyography sensors and tracking the operator body with a camera, using fiducial markers or AI body tracking!
Despite being old, with roots going back to the 1940s and 1950s in the remote manipulation of radioactive waste, teleoperation is still not a mature technology. My favourite overview of teleoperation shortcomings remains this 2007 paper, which mentions 8 limiting factors: cameras narrow field of view, figuring out the robot orientation and attitude, multi camera setups logistic, low frame rates, degradation due to motion, egocentric-vs-exocentric camera view tradeoffs, depth perception and stream latency. While all these directions still need to be perfected, in my experience the latency and its unpredictability, that is time delays in the sensor stream, remains the largest bottleneck to a fluid teleoperation experience. In this spirit, teleoperating fixed robots such as manipulators is going to be much easier than mobile robots in outdoor environments since the former can use wired connections, while the latter are forced to rely on cellular connections.
Currently there is a significant amount of research into improving the operator User Interface. For instance to alleviate latency it’s possible to overlay a predictive model to the real time feed, such as a “ghost“ of the future robot state. In this way the operator has the illusion to control a zero latency robot. Virtual fixtures and augmented reality markers can also help.
It is also possible to train an AI to map low dimensional inputs into complex actions, so that the operator can perform complex tasks with a simple joystick. For instance if a manipulator needs to grasp a cup on a table, the operator can simply instruct the manipulator to get closer to the cup and the AI will infer from the camera scene that the operator is looking to grasp it, therefore controlling all the fine motor skills required in the grasping. As a side effect, easier controls allow low skilled teleoperators to operate complex scenarios, which is important since today there are few expert operators.
It’s worth stressing that all this area around shared control and UI needs careful engineering, indeed paradoxically shared control can be harder than just automating, since we need to take into account how the operator reacts to the partial automation. It must be done well, keeping automated and human tasks separated and smoothly glued, so that the automation does not surprise the operator, causing the typical wait and see behaviour (input a command, wait for the robot to finish the movement, input another command, repeat).
In closing this intro, we can very crudely divide teleoperation tasks into two macro categories: driving and manipulation. These two classes present opposite challenges: driving (which includes piloting drones, cars and robot dogs walking) difficulty comes from the unpredictability of the environment and the requirement for fast reaction times, while in terms of control it’s easy (brake, accelerate, turn the steering wheel and not much more). Manipulation instead usually operates in slowly varying or fixed environments and in non time critical scenarios, but the controls are very nuanced and high dimensional. So for the rest of the article I will focus on robot manipulators, since that’s where the biggest learning challenges lie.
The Nascent Teleoperation Industry and Bootstrapping Automation
Today teleoperation is mainly being used in high touch use cases, such as medical surgery, nuclear decommissioning, space and undersea robotics. The operators are expensive professional domain experts, but their cost is justified since the alternative is too expensive or dangerous.
With the increased quality and dropping cost of collaborative robots and the advancements in artificial intelligence in the next few years teleoperation will expand to service use cases, such as in warehouses, light manufacturing, commercial kitchens and labs. How is this possible? Having a teleoperator, even if low skilled, continuously tele operating the robot will rarely make sense. In fact what will happen is that teleoperation will be used to bootstrap automation and then for remote assistance.
To understand why, it’s important to recall why automation is hard in the first place. Robots have superhuman precision already, so automating a fixed scenario it’s always possible with superhuman performance. The problem is that in real life no two environments are the same: different lighting, different objects to interact with, different arrangements, different success criteria. Considering these needs for autonomy and flexibility, here’s the example of how the development and deployment of a kitchen robot manipulator which is tasked to assemble rice bowls may look like:
-
The robot manipulator is trained to learn from a mixture of teleoperated, simulated and unsupervised experiences, from a standardised kitchen, using popular ingredients, tools and appliances. Every experience is divided into a set of basic actions, such as pick&place, mix, pour, sprinkle and it is logged as camera streams, position and velocity of the robot joints, force sensors and any other sensor measurement available. To every experience a score is also assigned, so that the robot learns what is an acceptable end state. After months of development the robot is able to reliably prepare bowls in the training scenario, but the performance in a different kitchen would be poor.
-
The system is deployed in a real kitchen, but for the first weeks a teleoperator has direct control of the robot, getting feedback from the kitchen owner. Every single experience with the new setup is logged and the AI is trained to imitate the operator. The AI performance is continuously tested, by asking in real time what action it would take and then confronting with the action that the operator actually takes. After a few weeks the AI is accurate enough to be left in control.
-
The teleoperator is called a few times a day to iron out the edge cases in which the AI keeps making mistakes. After some time the error rate is so low that a single teleoperator can assist more than 50 deployed robots at the same time.
-
The aggregate experiences are trained offline and the robot firmware is routinely updated, so that the robot reaches superhuman performances.
-
Future deployments proceed as in step 2, but the coaching time of the teleoperator keeps decreasing as the global dataset of experience increases in size. Eventually a few hours of demonstrations are enough to onboard a new kitchen. Also deploying the robot for different tasks becomes gradually easier, as basic actions such as pick&place can be reused.
In a first instance, robotic companies will vertically integrate and have their own fleet of teleoperators. Eventually specialised infrastructure providers will leverage economies of scale to provide teleoperation-as-a-service, providing flexible fleets of operators when needed. This is similar to how companies providing dataset labelling services operate today, but teleoperation is destined to be a much larger industry as the value of the market size being automated is bigger and the need for teleoperation assistance persists after the initial training. Where the teleoperators are actually located will depend on how critical latency is and what are the regulations and liabilities around remote work for physical tasks, something which at the moment is pretty niche. Besides professional services with strict accuracy standards and trained operators, it will be possible to crowdsource demonstrations from the public, perhaps even inside gamified environments.
In the long run, as teleoperation tooling and humanoid robots get cheaper, teleoperation will be rolled out to consumers for tele-existence. Hopefully by then we will not worry about work, and the main use cases will be around entertainment, social connections and exploration.
Data-Driven Reinforcement Learning Today
As said, offline reinforcement learning will be critical to scale since having a real robot to learn in a real environment is too slow and dangerous. Here I will scratch the tip of the research iceberg and mention a few approaches to tackle the offline reinforcement learning pipeline.
A crucial element in building a dataset for offline RL is establishing the reward of each experience. This is in my view the strongest explanation as to why a teleoperator is needed to bootstrap automation: the operator needs to understand what “good” means for every single deployment and act accordingly, iterating over the feedback of the new robot owner. In this light, I find the ideas in Scaling data-driven robotics with reward sketching and batch reinforcement learning pretty interesting.
Firstly they provide an intuitive mechanism to sketch the reward of a given experience, so that every single camera frame is rated according to how close we are to the desired goal. This helps having a more granular reward distribution than just rating trajectories as good or bad, even though it introduces some degree of subjectivity, since different operators will have different definitions of being close to the goal. More importantly, based on the human labelled rewards, they propose a mechanism to automatically relabel all the dataset accumulated over previous experiences, so that a large amount of data can be leveraged to learn a new task out a few initial demonstrations. Basically the reward annotations produced by the sketching procedure are used to train a reward model, which is then used to predict the reward of all the past data, according to the new definition of success. Ideally a dataset of 10.000 demonstrations for task A can be relabelled to be a dataset of 10.000 demonstrations for task B, assuming that the tasks are not too different. This approach is somewhat opposite to the usual deep learning paradigm of pretraining an AI agent on a large heterogeneous dataset and then fine-tuning using a small amount of data coming from the use case of interest.
A more traditional approach is followed in Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets, where a medium-size dataset (7200 demonstrations) collected from 71 kitchen tasks is used to bootstrap training on 10 unseed tasks, resulting in a 2x performance improvement with respect to just training the new tasks from scratch. It remains an open research question to quantify how much we can push the performance improvements if we leverage a truly large dataset, containing millions of demonstrations.
To my knowledge the easiest resource to get started with offline RL is Robomimic, an open source framework to learn from demonstrations. It provides a set of standardised datasets containing action-state-reward trajectories, with emphasis on human-provided demonstrations and support for multiple observation spaces, including visuomotor policies.
The datasets available also contain different sources such as single expert teleoperators, multiple teleoperators and machine-generated trajectories across several simulated and real-world tasks. It contains implementations for several offline learning and imitation learning algorithms, including Behaviour Cloning, Behaviour Cloning-RNN, HBC, IRIS, BCQ, CQL, and TD3-BC (by the way, when to use offline reinforcement learning vs imitation learning?). In the same ecosystem we find RoboTurk, a project to lower the barrier to create large scale crowdsourced datasets and RoboSuite a Mujoco-based simulation framework with benchmark environments.
Robomimic is well structured, with a clear documentation instructing on how to create a dataset and train an agent. It is still a bit rough, but hopefully the project will keep being maintained so that other projects can avoid duplicating efforts when training robots.
Finally I want to mention d3rlpy, a library for offline and online reinforcement learning. The best thing about it is that it’s intuitive and well documented, with clear examples already available in the README.
Outro
Bringing artificial intelligence to the real world is going to be hard, many and many practitioners agree.
But the takeaway message I want to convey is the following: there is a way to bootstrap automation in the short term, relying heavily on human-in-the-loop teleoperation. Such reliance on humans should be seen as a feature, not as a bug. Commercialising a novel product is always a massive undertaking, but in robotics this is exacerbated by the slow hardware development cycle. A teleop-first approach shortens the iteration cycles, incorporates feedback from the end-user and creates a pool of experiences on which to build scalable solutions.
Leave a comment