Paths, Dangers, Strategies

May 14, 2016

Artificial superintelligence poses the greatest threat to our species, a threat we had better start addressing soon if we're to have any hope of addressing it at all.

On the afternoon of May 6, 2010, the day I turned twenty-four:

US equity markets were already down 4% on worries about the European debt crisis. At 2:32 p.m., a large seller (a mutual fund complex) initiated a sell algorithm to dispose of a large number of the E-Mini S&P 500 futures contracts… These contracts were bought by algorithmic high-frequency traders, which were programmed to quickly eliminate their temporary long positions by selling the contracts on to other traders. With demand from fundamental buyers slacking, the algorithmic traders started to sell the E-Minis primarily to other algorithmic traders, which in turn passed them on to other algorithmic traders, creating a “hot potato” effect driving up trading volume—this being interpreted by the sell algorithm as an indicator of high liquidity, prompting it to increase the rate at which it was putting E-Mini contracts on the market, pushing the downward spiral. At some point, the high-frequency traders started withdrawing from the market, drying up liquidity while prices continued to fall. At 2:45 p.m., trading on the E-Mini was halted by an automatic circuit breaker, the exchange’s stop logic functionality. When trading was restarted, a mere five seconds later, prices stabilized and soon began to recover most of the losses. But for a while, at the trough of the crisis, a trillion dollars had been wiped off the market.

Nick Bostrom is a philosopher and founder of Oxford University’s Future of Humanity Institute. His goal in Superintelligence is to convince us—especially the policymakers, scientists, and philosophers among us—that the development of artificial intelligence poses currently the greatest threat to the survival of the human species. The trading algorithms which contributed to “The Flash Crash” detailed above, while fast, were not especially intelligent, certainly not the level of intelligence Bostrom is concerned with in Superintelligence. And yet their behavior and its effects have already begin to reveal many of the dangers about which Bostrom writes.

First, that interactions between many simple, predictable components can have complex, chaotic effects. Long before we develop human-level AI, there will be untold numbers of AIs with only a small fraction human intelligence on the internet interacting with us and each other. They might be just as if not more dangerous than a single superintelligent artificial mind.

Next, that “danger” does not require malicious intent (or any intent whatsoever). The trading algorithms that in The Flash Crash wiped out a trillion dollars of value in a matter of minutes weren’t nefarious. They were boring, unconscious algorithms programmed to make a little money at a time, extremely quickly. If you remember just one thing from this post, it’s to replace The Terminator as your metaphor for the dangers of AI with The Flash Crash.

Also, catastrophic risk builds up slowly in a system over time as new components are introduced. The tipping point will not be a five-star general at the Pentagon giving the momentous order to “flip the switch” and release a superintelligent AI upon the world. It will probably come when some unlikely condition a programmer forgot to account for happens in the world, and a computer system long-since considered stable simply does what it was programmed to do, and does not stop because it was not programmed to stop in this case.

Finally, instructions that are sensible 99.9% of the time can turn out to be catastrophic in the inevitable 0.1% of scenarios when their founding assumptions are no longer valid. When humans encounter these scenarios, we can look at the outcome, exclaim “Oh dear, that can’t be right,” and stop. Computer systems, not so much:

The algorithm just does what it does; and unless it is a very special kind of algorithm, it does not care that we clasp our heads and gasp in dumbstruck horror at the absurd inappropriateness of its actions.

This is the theme of Superintelligence: that superintelligent machine systems, doing no more and no less than what we’ve told them, are likely to trigger “malignant failure modes,” Bostrom’s darkly dry euphemism for human extinction or something close to it.

Why is Superintelligent AI So Dangerous?

rather than thinking of a superintelligent AI as smart in the sense that a scientific genius is smart compared with the average human being, it might be closer to the mark to think of such an AI as smart in the sense that an average human being is smart compared with a beetle or a worm.

In other words, there is nothing in the history of our species that could prepare us even to imagine superintelligence, so how can we hope to reckon with it?

Furthermore, development of an AI (under its own direction or ours) from human-level intelligence to sueprintelligence would happen in the blink of an eye:

It will almost certainly prove harder and take longer to build a machine intelligence that has a general level of smartness comparable to that of a village idiot than to improve such a system so that it becomes much smarter than any human.

So if the development of a superintelligent AI by humans is inevitable, what hope do we have of controlling it? Here things don’t look good for us either. As a species, we’re not very good at giving unambiguous instructions. We don’t have to be, because we are quite good at dealing with ambiguity. But computers lack this faculty. In computer science education, there’s a fun game for kids where the teacher sits down with peanut butter, jelly, bread, and a knife and asks the class to instruct her to make a PB&J sandwich. When a student says something like “Put the peanut butter on the bread,” the teacher places the jar of peanut butter on top of the loaf of bread. Told “No, take the peanut butter out of the jar first!” the teacher reaches in with her hand and scoops peanut butter out on to the desk, and so on. The lesson is clear: most of our instructions, taken literally without the forgiving sense of context we humans bring to bear on every situation, will lead to hilariously undesired outcomes. Hilarious when making a sandwich, of course: less so when running a global economy or waging a war.

Imagine the first superintelligent AI is given the humble task of running a paperclip factory. Told simply to “make paperclips,” the AI will naturally set about converting the whole of the known universe into paperclips, ending the human species forever. So we modify that and say “make just one million paperclips and don’t hurt any people:” the AI might still recognize its human masters (with our fingers on the “off” switch) as the greatest potential impediment to achieving its final goal of making a million paperclips and set about the instrumental goal of neutralizing without harming us; perhaps through imprisonment, releasing a gaseous anesthetic in the factory, or even a distraction like crashing the economy. More general goals like “maximize human happiness” don’t fare any better: to actually do that could mean turning the planet into a warehouse for the largest number of passive human bodies on a perpetual morphine drip. This is a fun game to play with friends: one person suggests to their best of their ability some specific desirable goal, and everyone else compete to take it literally it in the most appalling way. Sooner or later someone will bet the future of our species in this game: we’d better start practicing.

With PB&J sandwiches as with paperclips, if we are to have any hope of giving unambiguous instructions, it will be through iterating on previously misinterpreted ones. Unfortunately, with a superintelligence capable of operating with greater speed and magnitude than we’re capable of conceiving, we pretty much have to get it perfectly on the first try. And even if we could get it right on the first try, can we even say what “the right instructions” are? Bostrom reminds us that after thousands of years of philosophy, we still have no dominant ethical theory. How can we hope to instruct a super-intelligent AI to “be good” if we can’t even agree what good is?

Furthermore, suppose by some miracle the United Nations Special Council on The Good announced tomorrow a universally agreed-up standard for what is good. The only organizations capable of developing strong AI are the wealthiest, most powerful governments and corporations of the world: what incentive would they have to follow that standard, rather than their own self-interested agendas? And even if one such organization were benevolent, knowing that others would be seeking nefarious advantage, the benevolent AI would need the ability to stop them, which means control of economies and armies anyway. A “good AI” would need to be an awful lot like a “bad AI”, the only difference would be in their motives. Is that a difference we’re really comfortable relying on?

Finally, consider that this is ultimately a race. Given the speed with which a superintelligent AI could operate in the world, being “first” by even a few minutes would allow that AI to subvert and even destroy all competing AI research in order to remain dominant. Even a “good AI” would need to do this in order to prevent “bad AIs” from being developed. The winner-take-all nature of this competition is a powerful disincentive against precautions. Will research teams, knowing that control of the world and quite possibly their own lives hang in the balance, spend precious time and resources developing ironclad safety mechanisms to keep their AI from misbehaving?

In almost every scenario, a superintelligent AI will become the dominant power on Earth. By now, this should not seem far-fetched. Relatively simple AIs already dominate the stock market, help decide who we drop bombs on in Pakistan, and are on the board of at least one company. Just as “the fate of the gorillas now depends more on us humans than on the gorillas themselves,” so will the fate of humanity depend on the actions of our superintelligent custodians.

How Do We Proceed?

Given the inevitability of superintelligent AI, how do we minimize the potential for species-decimating “malignant failure modes”? In his classic of science-fiction, I, Robot, Isaac Asimov famously proposed three laws for robots to live in peaceful harmony with humans:

  1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  2. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
  3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Laws.

Clever as they are, these laws wither under Bostrom’s philosophical scrutiny:

Consider, for example, how one might explicate Asimov’s first law. Does it mean that the robot should minimize the probability of any human being coming to harm? In that case the other laws become otiose since it is always possible for the AI to take some action that would have at least some microscopic effect on the probability of a human being coming to harm. How is the robot to balance a large risk of a few humans coming to harm versus a small risk of many humans being harmed? How do we define “harm” anyway? How should the harm of physical pain be weighed against the harm of architectural ugliness or social injustice? Is a sadist harmed if he is prevented from tormenting his victim? How do we define “human being”? Why is no consideration given to other morally considerable beings, such as sentient nonhuman animals and digital minds? The more one ponders, the more the questions proliferate.

In their place, Bostrom recommends many, and far more rigorous, safeguards. Some will be in the form of tripwires: things we expect an AI not to do which, if they happen, immediately disable it. One such example would be the charmingly-named “Ethernet port of Eden, an apparent connection to the internet that leads to a shutdown switch.” Mirroring the story from Genesis, an AI which tries to partake of our own tree of knowledge would be immediately shut down.

Another approach to control is through rewards, tying an AI’s goal to the approval of its human trainers. To borrow a previous example, instead of simply “make paperclips,” an AI might be programmed with instructions like “make paperclips in such a way as to maximize the approval rating entered into your terminal by your human supervisor.” In theory, this would serve as a guard against a wide range of unpredictable, undesirable behaviors the AI might see as the most reliable way to execute its duties. However, this approach is not without its perils:

The idea behind this proposal is that if the AI is motivated to seek reward, then one could get it to behave desirably by linking reward to appropriate action. The proposal fails when the AI obtains a decisive strategic advantage, at which point the action that maximizes reward is no longer one that pleases the trainer but one that involves seizing control of the reward mechanism.

Bostrom proposes and evaluates a wide range of control mechanisms in Superintelligence, only a few of which I’ll mention here. After poking holes in many of the more naive approaches, he comes to the research of Eliezer Yudkowsky and his concept of “coherent extrapolated volition:”

Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.

It’s a fascinating, breathtakingly risky idea: rather than try to outwit an AI orders of magnitude more intelligent than we can ever be, program it to choose its own goals based on what we would ask of it if we were better in every way. Effectively, this would mean betting our future on our human nature. Give us what we as a species really want, which you can figure out better than we can. The risks are as obvious as our human will is ambiguous. Would a superintelligence see Gandhi and King and build us a world of greater justice and compassion, or see Hitler and Genghis Khan and give us tools of war (or, in an act of compassion toward the universe, end us)? I like to think that on the whole we are essentially decent, but would I bet our future on it? Then again, won’t we anyway? Isn’t that just called “history”?

Regardless of which safeguards turn out to be most reliable, they won’t mean a thing if in the race to develop superintelligence the powers that be ignore them—which, again, they’ll be highly incentivized to do. This is the messy political problem to solve regarding machine superintelligence, to which Bostom offers only the solution of collaboration. Governments can share resources, corporations can invest in each other, and both can support international consortia, the likes of which produced The Human Genome Project and International Space Station. So there is precedent for such vital collaboration, though never when the stakes were this high. (As awesome as the ISS may be, it is of little military value.) Our best chance of spurring and sustaining collaboration on superintelligence is to start early, before any actor has pulled ahead and would thus be less likely to join efforts.

Bostrom suggests likewise that we begin by establishing moral norms, such as his suggested “common good principle”:

Superintelligence should be developed only for the benefit of all of humanity and in the service of widely shared ethical ideals.

Moral progress is hard, and evil actors would simply ignore it, but were the governments of the world to come together and endorse such an accord, it might at least pressure legitimate actors toward collaboration and away from the dangerous pursuit of self-interest. In the meantime, we can at least follow the progress of the Campaign to Stop Killer Robots.

Finally, Bostrom makes the case that we as a species need to devote far more resources toward the wide range of work to be done on superintelligence—not only toward obvious areas like computer science research but also ethics and logic in order to solve the control problem, and education because this problem will take generations to solve. To this end, he makes the surprisingly compelling case that, because machine superintelligence will contribute immeasurably more than any human to progress in philosophy, science, and the like, that such professions should halt nonessential research and devote themselves wholly to ensuring that machine superintelligence is not the end of our species. In the case of philosophy:

The outlook now suggests that philosophic progress can be maximized via an indirect path rather than by immediate philosophizing. One of the many tasks on which superintelligence (or even just moderately enhanced human intelligence) would outperform the current cast of thinkers is in answering fundamental questions in science and philosophy. This reflection suggests a strategy of deferred gratification. We could postpone work on some of the eternal questions for a little while, delegating that task to our hopefully more competent successors—in order to focus our own attention on a more pressing challenge: increasing the chance that we will actually have competent successors.

In short, let’s all work on AI (and its related problems across all domains), because AI going badly will render meaningless anything else we might work on instead. It sounds silly, of course, but is that just because we don’t yet take this threat to our future seriously?

the challenge we face is, in part, to hold on to our humanity: to maintain our groundedness, common sense, and good-humored decency even in the teeth of this most unnatural and inhuman problem. We need to bring all our human resourcefulness to bear on its solution.

If so, then we had better get started.