AI Alignment is the 21st Century's Version of Perpetual Motion
Only philosophical thinking can demonstrate the impossibility of moral machines.
Two Impossible Machines: Perpetual, and Moral
For about 700 years, a lot of very good scientific thinkers and engineers poured their efforts into trying to develop machines that could run forever once they were started up. The aim of the perpetual motion project was to achieve unlimited free energy for humanity. If they could get machines to run forever on finite energy, it would almost entirely eliminate the need for human labor. Most of us could live lives of leisure while a handful of engineers steered machines as they did the real work.
This should sound familiar in the early years of the AI age. The parallel problem for AI researchers is not getting the machines to run forever, but getting them to run indefinitely without destroying anything we really care about. The solution? Make AI that is “aligned” to human values, so we can trust it to run free without killing us, or enslaving us, or deceiving us, or destroying the world. This project of AI alignment is the perpetual motion project of the 21st century: it’s a highly technical project, run by some of the smartest people on the planet, in pursuit of a goal that they don’t realize is ultimately impossible, because they haven’t thought deeply enough about their end goal.
Let’s return to the fascinating history of perpetual motion to build the parallels.
It’s important to know that the people doing perpetual motion research weren’t eccentric retired uncles tinkering in their basements. They were accomplished engineers and scientists, and a few of them--like Leibniz, the inventor of calculus and binary code; and Boyle, the father of modern chemistry--were stone cold geniuses.
But, despite perpetual motion’s noble motivations and brilliant researchers, the project was doomed from the start. But this was far from obvious at the time, and it took the better part of a thousand years for the futility of the project to become evident.
What was it that finally convinced these brilliant scientists and engineers that their 700-year project was doomed? You might think that it was the fact that all this technical work failed to ever produce an actual perpetual motion machine, but you’d be wrong. They produced consistent, significant improvements in the efficiency of their machines: flywheels that could spin for longer, bearings that produced less friction, and a deeper and more precise understanding of mechanics. And that stream of consistent improvements made it seem like they were constantly getting closer to cracking the problem. What was actually failure seemed exactly like progress.
Even if they’d never made any of that progress, though, failure can never really prove that your end goal is impossible. At most, it just shows that you haven’t yet succeeded.
What can show that your end goal is impossible, though, is philosophical analysis--the kind of abstract thinking that seems, to many engineers, like eye-rollingly abstract armchair philosophy. (I know, because I taught engineers to be philosophers for almost a decade.)
In the end--1850, to be exact--Lord Kelvin laid out a philosophical argument that effectively ended the perpetual motion project forever. This line of reasoning, which he fired like a silver bullet from his philosopher’s armchair, simply clarified the meaning of a single concept: energy.
Up to that point, the term energy had been used pretty vaguely, much in the way we use it today to mean everything from electricity, to how lively our bodies feel, to whatever souls or ghosts are supposedly made of.
To understand how a single clear definition killed a 700-year research project, consider the core fantasy of perpetual motion: a machine that could turn a finite amount of energy into an infinite amount of work. The foundation of that fantasy was the lack of clarity about what energy is. Kelvin cured the scientific world of this confusion by clarifying the concept of energy with the following definition:
Energy is that which is transferred out of a system through work.
This means that:
Work is the transference of energy out of a system.
The clear implication of this conceptual clarification was that, in order for any machine to run forever--that is, for it to do infinite work--it would have to start with infinite energy. So, to make a flywheel spin forever would require an initial push infinitely more powerful than the big bang, or an amount of fuel infinitely larger than the entire universe. But this is clearly impossible.
In this way, a single conceptual clarification killed a technical engineering project that had produced what seemed like substantial progress for over 700 years. The reason so many brilliant people had done so much work in the service of an impossible aim is not that they were bad engineers, but that they weren’t very good philosophers. Because they were technically-minded, they had assumed--for seven centuries--that the obstacle to achieving their goal must be the technical work of engineering better machines.
The true obstacle, however, was the philosophical work of clearly understanding the nature of the concepts their technical project was built upon.
Once that philosophical clarification was done, it was clear that a perpetual motion machine is the same kind of incoherent non-thing as a square circle. If you weren’t clear on the concept of a square circle, you could waste your life trying to draw one, believing you were getting closer and closer, and constantly getting genuinely better at technical drawing along the way.
AI alignment is the perpetual motion project of the 21st century. It is a highly productive, highly technical project that harnesses some of our most brilliant people in pursuit of a goal that is literally impossible. Just as the scientists of the 19th century hadn’t understood what perpetual motion machine really meant, we haven’t yet understood what an AI aligned to human values really means. It would mean something like this:
A set of silicon chips that uses etched logic gates to arrange electrons into a matrix of floating point numbers that we can trust to function like a moral person.
That last thing--moral person--is the key here. Morally good people are not what they are because their brains are running morally good software with the right set of numerically-weighted values. Good people are what they are because they’re healthy, whole persons who have grown into their humanity through the experience of living their own human life. They behave they way they do precisely because they are real humans with real human experiences and values, and not anything else-- not a tree, or a storm, or a set of numbers on a computer chip. We can rely on good people because their actions arise not out of a set of numeric model weights run on a transformer with the right temperature setting, but out of a distinctively human way of experiencing and living in the world.
There’s simply no reason to think that it’s possible to distill human goodness into a set of numbers--and every reason to think that it’s not possible. Yet this is exactly what AI alignment basically requires. (Footnote for the nerds.1)
When it comes to morality, function follows form. Humans cannot run Excel because we are not computers. A computer cannot run human morality because it is not a human.
The AI space is chasing an impossible fantasy because it doesn’t understand nature of human values and the moral way of being that they give rise to. The technical achievements that come out of the alignment project are just as real as those that came out of the perpetual motion project, but the belief that they are achievements toward the goal of alignment are just mistaken.
So when you hear that we’re getting ever closer to AI that we can trust to educate our kids, control our national defense systems, or be our romantic partners because we’ve made sure it’s running on human values, you can be certain that, no matter how smart the people working on this project are, or how much apparent progress they’ve made, we’re always infinitely far from the goal of moral machines.
Something Good
You have never heard anyone play guitar the way Jon Gomm does. It’s like he was a feral child who found an abandoned guitar in a cave the wolves were raising him in, and then spent the next thirty years just doing whatever felt good with it. The results are astonishing, beautiful, and strange. It will make you glad that there are some people who make their way through the world without giving in to the demand to not do anything too weird.
This is a sufficiently simple way of saying it for this piece. But, for the tech-savvy who worry I’m just ignorant of how sophisticated the work on this problem is: What aligned AI requires, a bit more technically, is several things, like: (1) Nailing down a reward function that corresponds to actual human preferences in novel scenarios, which we don’t actually have a reliable method for, given that human preferences are not consistently predictable from features of the context terrain. (2) Training the model in a way that is *knowably* invulnerable to reward hacking, which is maybe not possible in principle and almost certainly not possible in practice given continual advancement in AI’s strategic capabilities, the impossibility of qualitatively defining the notion of deception in a way that could be sufficiently guiding for an AI that *wanted* to avoid deception, and the extreme difficulties in getting an AI to optimize for any value well enough to trust it with our weapons without rendering itself useless for anything that you can charge consumers for. (3) Producing from probably-impossible (1) and (2) a set of weights that correspond so robustly to the actual set of human values that the behavior of a model running on them in the wild can’t wander far enough from, or through a loophole in, human preferences to cause any catastrophic harms.

