Originally published on Medium.
Plastic Dinosaur’s normal living space is a big warehouse with a lot of interesting features for it to clamber over and peer behind. But it’s noticed that the bipeds he shares the space with come and go through parts of the barriers that contain him. He’s noticed that they can swing part of the barrier open, pass through the opening and then the barrier closes behind them. He’s seen space behind the barriers that he hasn’t explored. Yet.
Plastic Dinosaur wanders over to the door, peering at things as he goes to see if anything more interesting shows up. Nothing does, and he’s at the barrier. There’s a feature on the barrier, a horizontal piece that sticks out. Plastic Dinosaur peers closely at it, sniffs it and nudges it with its nose. He pushes it this way and that, seeing some movement, but nothing happens. He fails to get past the barrier.
He starts getting hungry, i.e. his battery charge is getting low. Off Plastic Dinosaur trots to his charging block, where he settles down and starts dreaming. He dreams a thousand virtual, parallelized dreams of the barrier and the horizontal bar, pushing it this way and that, pulling it this way and that, rotating it, using his hands and nose and tail. And in one of them, it turns and the barrier opens.
Plastic Dinosaur’s charging is over. He stops dreaming and wakes up, still curious about what’s behind the barrier. So off he goes, and this time Plastic Dinosaur reaches out with his paw, fumbles at the door lever and opens the door. Plastic Dinosaur walks through into a space he’s never seen before.
Shrieking ensues. Plastic Dinosaur has learned to open doors.
This is an article in the series David Clement, Co-Founder of Senbionic, and I are collaborating on regarding the state of the art of neural networks and machine learning using a fictional robotic velociraptor as a fun foil. The first article dealt with its body, the second its neural network brains, the third with attention loops and features and how they can be used to train a neural network and the fourth with potentially adverse effects of simple pattern matching in the form of the velociraptor becoming terrified of white people.
Suffice it to say that he’s a sensor-laden, lithium-ion battery powered, induction charged, aluminum and plastic robot wrapped in smart fabric and that he’s run by three neural networks. The first is cerebellumnet, which pays attention to the inside of his body and controls his motion, just as our autonomic nervous system and cerebellum control ours. The second is amygdalanet, which replicates the fight or flight, emotions (especially fear) and unconscious decision making of our amygdala. The third is curiousnet, which pays attention to the world outside of his body and is wired to explore new things and identify them.
This article is a deeper dive into the dreaming model introduced in the second piece which defined cerebellumnet, amygdalanet and curiousnet. That concept of a virtual simulation with high-speed Monte Carlo attempts to achieve a desired outcome is key to this conceptual model.
When David and I were discussing this early on, we had different intellectual positions about evolutionary simulation for robots. His premise was that evolution would have to be recreated with accreting functions and capabilities, basically with life evolving in simulation from single-celled to multi-celled to vertebrate structures in order for an integrated organism to be created. My premise was that a highly sensor laden and physically constrained object could be created both physically and virtually with high degrees of similarity, allowing for evolution of control but not form. He may very well be right for anything which evolves consciousness and sophisticated behaviors, but for the Plastic Dinosaur series we agreed to my constraints, partly because an AI driven velociraptor is much more engaging than watching single-celled organisms divide for most people.
And he had form with this simulation. Meet TRPO.
Click on the video and you’ll see the simple human form learn to stand, with significant improvements after 50,000 iterations. That’s an implementation of Schulman and Levine’s trust region policy optimization (TRPO natch). They used it to create ongoing improvements in swimming, walking, hopping, and playing Atari video games of all things. Like subsumption, David internalized it by recreating their work with a simple human form and running the simulation to have it stand. He put simple constraints of angles of rotation on the joints, established the goal and let it run.
This is what dreaming looks like for Plastic Dinosaur. A thousand virtual PDs trying a thousand different things a thousand different times until the virtual is good enough to be decanted into the physical. These simulations are trivial in comparison to full scale atmospheric modeling and executable on common and reasonably priced hardware available today, including just by renting cycles from AWS or Azure. Three months of a massive GPU on Azure costs about $680, for example.
But the kicker is the Rubik’s Cube. I brute-forced my way through the countless experiments necessary to learn to solve the cube with the neural net in my head without outside guidance in Grade 11, because I like beating my head against brick walls and math was boring. Managed to get under 70 seconds, which is glacially slow by speed cubing standards, but I was happy with it. A lot more people have picked up the Cube and put it down without ever solving it than not. And the number of people who can do it one handed is even smaller, but the world record is about 8 seconds. Speed solving is a high-memorization process like chess, but it’s not the solution that’s interesting, it’s the incredibly complex hand.
We never think about our hands, but there are 27 bones in each one, they are capable of remarkable degrees of movement and are one of the most sensor laden components of our bodies. They have evolved to remarkable degrees of competence, and speed cubers and magicians constantly amaze us with the art of the possible.
So solving the cube isn’t the interesting part of this, although it’s a neat trick. It’s getting a human hand equivalent to do it one handed.
Watching that video, you’ll see them training a virtual hand that is identical in constraints to a physical robotic hand, with reinforcement learning and their innovation of automatic domain randomization performed tens of thousands of times in the virtual for every attempt in the physical. Then they start interfering with the hand with among other things, a stuffed giraffe.
There are two pieces to unpack there. The first is reinforcement learning. This is the reward mechanism for positive outcomes. It’s used widely, but came to global prominence with AlphaGo. That’s the Alphabet-acquired Deepmind company’s neural net solution that beats everybody at Go, including world masters with decades of playing experience. It’s a much harder game than Chess and also unlike chess, you can’t memorize your way to success. It’s used in a variety of domains. Among other things, Google is using it to make its data centers more energy efficient.
The second is automatic domain randomization. In OpenAI’s case, the team kept changing the physical rules as the virtual and physical simulations got better on the easiest case. They made the cubes heavier and lighter, and made them rotate more and less smoothly. In the physical, they put a rubber glove over the hand, dropped scraps of paper on it, and nudged it with a stuffed giraffe to explore the limits of its ability to focus on the task through increasing degrees of noise.
This is the equivalent of the massive virtual parallel attempts that Plastic Dinosaur makes to solve problems. Every time he fails to achieve an objective, the next time he sleeps he dreams a bunch of dreams that allow him to improve. And conceptually, because his reinforcement learning includes getting more efficient at things, he’ll become smoother at opening doors as time passes too.
The Rubik’s Cube solving robotic hand is still not perfect. It succeeds most of the time, drops the cube sometimes, and isn’t as fast as human cubers. But this is an incredibly sophisticated degree of control over the robotic facsimile of the most sophisticated physical manipulator on our bodies. And it’s solving Rubik’s Cube. However, it’s also showing the state of the art. That’s one hand, replicated in virtual and physical, doing one task that’s highly modeled already. That took 2.5 years. And it has two neural networks compared to PD’s hypothetical three.
The concept of instantiating multiple parallel simulations automatically based on recorded sensor data and spinning through a lot of variations is considerably more sophisticated in a couple of different ways, but it’s easy to see where that might be an accessible target. The Rubik’s Cube model suggests that a lot more than three neural networks would be required, which is something David and I agree on, but have limited for the sake of the fictional foil.
Another interesting question that arises is how it would choose what to simulate? Think about your daily routine as you walk through the world, encountering dozens of doors of dozens of weights with dozens of different door mechanisms including automatic ones. Think about a paper coffee cup vs a ceramic one vs a glass one with a serviette around it. Think about encountering an out-of-service escalator and walking up the unusual height steps, or another set of steps that due to architectural and aesthetic choices are different in depth and tread width.
We adapt incredibly smoothly and quickly, exploring the absurd variance of the physical world without really noticing how clever we are. Until we try to imagine doing it with neural nets and a robotic velociraptor.