The Alignment Problem by Brian Christian

A lucid account of modern machine learning research. Christian does a phenomenal job giving the historical development of ideas in machine learning, particularly in the interaction of the fields with neighboring disciplines. I particularly enjoyed the sequence describing the exchange between reinforcement learning and neuroscience. That Sutton’s idea of temporal-difference learning could model the role of dopamine is a great scientific achievement and gives real bones to the idea that artificial intelligence can be a “transcendental psychology” in the Kantian sense, studying not how intelligence manifests in humans but rather the conditions for the possibility of intelligence.

The struggles of learning actions to accomplish explicit goals, and the subsequent development of imitation learning and inverse reinforcement learning as a response, is really interesting. I think these developments parallel earlier developments in cognitive science, namely, embodied and embedded cognition: if beliefs and desires aren’t just “symbols in the head” but rather are a bit more mushy, then it would make sense that we don’t bother representing such beliefs and desires explicitly in the AI systems that we build. What does it mean to do a backflip? I’m not sure, but I know it when I see it; and sometimes I can show you directly. That’s as much as we can do for a lot of concepts, a tantalizing possibility that Christian contemplates:

I tell them that what makes the result feel, to me, not just so impressive but so hopeful, is that it’s not such a stretch to imagine replacing the nebulous concept of “backflip” with an even more nebulous and ineffable concept, like “helpfulness.” Or “kindness.” Or “good” behavior. “Exactly,” says Leike. “And that’s the whole point, right?”

Of course, you can take this to be a practical limitation. Maybe it’s in principle possible to define exactly what we mean, and represent it formally as a goal for an RL agent to follow. But I would rather take this as a Wittgensteinian picture of learning, one where definitions of concepts are given by their use and thus cannot be disentangled from them. We play language games where “kindness” comes into play, and we mean by “kindness” is nothing but the enumeration of these games (if we can enumerate them at all). So it would not be surprising at all that we can’t give necessary and sufficient conditions for kindness, let alone a formal representation of it; but we sure can give examples of being kind, and we can demonstrate it. Under this view, then, to teach an RL agent is nothing less than to assimilate it in our “forms of life.” (Given this, efforts to vouch for the ethical treatment of RL agents sound a bit less eggheaded.)

This leads me to a more general point: I think it would be very fruitful looked into philosophy for insights. The discussion on fairness clearly echoes a lot of discussion in moral and political philosophy (and I’ve seen more than one fairness paper mention and cite A Theory of Justice and then proceed to not engage with its contents at all); the tangle of specifying goals in reinforcement learning clearly echoes debates in philosophy of mind. To his credit Christian does, particularly with ethics, discuss connections between AI and philosophy, but I think there are a lot more connections to be made.

As far as the alignment problem itself, I am partial on Christian’s take on it:

Even if we—that is, everyone working on AI and ethics, AI and technical safety—do our jobs, if we can avoid the obvious dystopia and catastrophes, which is far from certain—we still have to overcome the fundamental and possibly irresistible progression into a world that increasingly is a formalism. We must do this even as, inevitably, we are shaped—in our lives, in our imaginations, in our bodies—by those very models. This is the dark side of Rodney Brooks’s famous robotics manifesto: “The world is its own best model.” Increasingly, this is true, but not in the spirit Brooks meant it. The best model of the world stands in for the world itself, and threatens to kill off the real thing. We must take great care not to ignore the things that are not easily quantified or do not easily admit themselves into our models. The danger, paraphrasing Hannah Arendt, is not so much that our models are false but that they might become true.

It’s not the existential risk of paperclip maximizers that I’m worried about. Rather, it is the fact that the world is increasingly becoming hyperreal, where the distinction between the world and its representations is increasingly becoming blurred. Those who long for the Singularity and uploading our consciousness into the internet really are longing for the acceleration of hyperreality, which is why I don’t understand them. They see this terrible thing happening and they want more of it!