AI Safety Seems Hard to Measure

Click lower right to download or find on Apple Podcasts, Spotify, Stitcher, etc.

In previous pieces, I argued that there's a real and large risk of AI systems' developing dangerous goals of their own and defeating all of humanity - at least in the absence of specific efforts to prevent this from happening.

A young, growing field of AI safety research tries to reduce this risk, by finding ways to ensure that AI systems behave as intended (rather than forming ambitious aims of their own and deceiving and manipulating humans as needed to accomplish them).

Maybe we'll succeed in reducing the risk, and maybe we won't. Unfortunately, I think it could be hard to know either way. This piece is about four fairly distinct-seeming reasons that this could be the case - and that AI safety could be an unusually difficult sort of science.

This piece is aimed at a broad audience, because I think it's important for the challenges here to be broadly understood. I expect powerful, dangerous AI systems to have a lot of benefits (commercial, military, etc.), and to potentially appear safer than they are - so I think it will be hard to be as cautious about AI as we should be. I think our odds look better if many people understand, at a high level, some of the challenges in knowing whether AI systems are as safe as they appear.

First, I'll recap the basic challenge of AI safety research, and outline what I wish AI safety research could be like. I wish it had this basic form: "Apply a test to the AI system. If the test goes badly, try another AI development method and test that. If the test goes well, we're probably in good shape." I think car safety research mostly looks like this; I think AI capabilities research mostly looks like this.

Then, I’ll give four reasons that apparent success in AI safety can be misleading.

“Great news - I’ve tested this AI and it looks safe.” Why might we still have a problem?
Problem Key question Explanation
The Lance Armstrong problem Did we get the AI to be actually safe or good at hiding its dangerous actions?

When dealing with an intelligent agent, it’s hard to tell the difference between “behaving well” and “appearing to behave well.”

When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually “clean.” It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.

The King Lear problem

The AI is (actually) well-behaved when humans are in control. Will this transfer to when AIs are in control?

It's hard to know how someone will behave when they have power over you, based only on observing how they behave when they don't.

AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to take control of the world entirely. It's hard to know whether they'll take these opportunities, and we can't exactly run a clean test of the situation.

Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.

The lab mice problem Today's "subhuman" AIs are safe.What about future AIs with more human-like abilities?

Today's AI systems aren't advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans.

Like trying to study medicine in humans by experimenting only on lab mice.

The first contact problem

Imagine that tomorrow's "human-like" AIs are safe. How will things go when AIs have capabilities far beyond humans'?

AI systems might (collectively) become vastly more capable than humans, and it's ... just really hard to have any idea what that's going to be like. As far as we know, there has never before been anything in the galaxy that's vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can't be too confident that it'll keep working if AI advances (or just proliferates) a lot more.

Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).

I'll close with Ajeya Cotra's "young businessperson" analogy, which in some sense ties these concerns together. A future piece will discuss some reasons for hope, despite these problems.

Recap of the basic challenge

A previous piece laid out the basic case for concern about AI misalignment. In brief: if extremely capable AI systems are developed using methods like the ones AI developers use today, it seems like there's a substantial risk that:

  • These AIs will develop unintended aims (states of the world they make calculations and plans toward, as a chess-playing AI "aims" for checkmate);
  • These AIs will deceive, manipulate, and overpower humans as needed to achieve those aims;
  • Eventually, this could reach the point where AIs take over the world from humans entirely.

I see AI safety research as trying to design AI systems that won't aim to deceive, manipulate or defeat humans - even if and when these AI systems are extraordinarily capable (and would be very effective at deception/manipulation/defeat if they were to aim at it). That is: AI safety research is trying to reduce the risk of the above scenario, even if (as I've assumed) humans rush forward with training powerful AIs to do ever-more ambitious things.

(Click to expand) More detail on why AI could make this the most important century

In The Most Important Century, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.

This page has a ~10-page summary of the series, as well as links to an audio version, podcasts, and the full series.

The key points I argue for in the series are:

  • The long-run future is radically unfamiliar. Enough advances in technology could lead to a long-lasting, galaxy-wide civilization that could be a radical utopia, dystopia, or anything in between.
  • The long-run future could come much faster than we think, due to a possible AI-driven productivity explosion.
  • The relevant kind of AI looks like it will be developed this century - making this century the one that will initiate, and have the opportunity to shape, a future galaxy-wide civilization.
  • These claims seem too "wild" to take seriously. But there are a lot of reasons to think that we live in a wild time, and should be ready for anything.
  • We, the people living in this century, have the chance to have a huge impact on huge numbers of people to come - if we can make sense of the situation enough to find helpful actions. But right now, we aren't ready for this.
(Click to expand) Why would AI "aim" to defeat humanity?

A previous piece argued that if today’s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures).

In brief:

  • Modern AI development is essentially based on “training” via trial-and-error.
  • If we move forward incautiously and ambitiously with such training, and if it gets us all the way to very powerful AI systems, then such systems will likely end up aiming for certain states of the world (analogously to how a chess-playing AI aims for checkmate).
  • And these states will be other than the ones we intended, because our trial-and-error training methods won’t be accurate. For example, when we’re confused or misinformed about some question, we’ll reward AI systems for giving the wrong answer to it - unintentionally training deceptive behavior.
  • We should expect disaster if we have AI systems that are both (a) powerful enough to defeat humans and (b) aiming for states of the world that we didn’t intend. (“Defeat” means taking control of the world and doing what’s necessary to keep us out of the way; it’s unclear to me whether we’d be literally killed or just forcibly stopped1 from changing the world in ways that contradict AI systems’ aims.)
(Click to expand) How could AI defeat humanity?

In a previous piece, I argue that AI systems could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.

By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply “containing” us in some way, such that we can’t interfere with AIs’ aims.

One way this could happen is if AI became extremely advanced, to the point where it had "cognitive superpowers" beyond what humans can do. In this case, a single AI system (or set of systems working together) could imaginably:

  • Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.
  • Hack into human-built software across the world.
  • Manipulate human psychology.
  • Quickly generate vast wealth under the control of itself or any human allies.
  • Come up with better plans than humans could imagine, and ensure that it doesn't try any takeover attempt that humans might be able to detect and stop.
  • Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries.

However, my piece also explores what things might look like if each AI system basically has similar capabilities to humans. In this case:

  • Humans are likely to deploy AI systems throughout the economy, such that they have large numbers and access to many resources - and the ability to make copies of themselves.
  • From this starting point, AI systems with human-like (or greater) capabilities would have a number of possible ways of getting to the point where their total population could outnumber and/or out-resource humans.
  • I address a number of possible objections, such as "How can AIs be dangerous without bodies?"

More: AI could defeat all of us combined

I wish AI safety research were straightforward

I wish AI safety research were like car safety research.2

While I'm sure this is an oversimplification, I think a lot of car safety research looks basically like this:

  • Companies carry out test crashes with test cars. The results give a pretty good (not perfect) indication of what would happen in a real crash.
  • Drivers try driving the cars in low-stakes areas without a lot of traffic. Things like steering wheel malfunctions will probably show up here; if they don't and drivers are able to drive normally in low-stakes areas, it's probably safe to drive the car in traffic.
  • None of this is perfect, but the occasional problem isn't, so to speak, the end of the world. The worst case tends to be a handful of accidents, followed by a recall and some changes to the car's design validated by further testing.

Overall, if we have problems with car safety, we'll probably be able to observe them relatively straightforwardly under relatively low-stakes circumstances.

In important respects, many types of research and development have this basic property: we can observe how things are going during testing to get good evidence about how they'll go in the real world. Further examples include medical research,3 chemistry research,4 software development,5 etc.

Most AI research looks like this as well. People can test out what an AI system is capable of reliably doing (e.g., translating speech to text), before integrating it into some high-stakes commercial product like Siri. This works both for ensuring that the AI system is capable (e.g., that it does a good job with its tasks) and that it's safe in certain ways (for example, if we're worried about toxic language, testing for this is relatively straightforward).

The rest of this piece will be about some of the ways in which "testing" for AI safety fails to give us straightforward observations about whether, once AI systems are deployed in the real world, the world will actually be safe.

While all research has to deal with some differences between testing and the real world, I think the challenges I'll be going through are unusual ones.

Four problems

(1) The Lance Armstrong problem: is the AI actually safe or good at hiding its dangerous actions?

First, let's imagine that:

  • We have AI systems available that can do roughly everything a human can, with some different strengths and weaknesses but no huge difference in "overall capabilities" or economic value per hour of work.
  • We're observing early signs that AI systems behave in unintended, deceptive ways, such as giving wrong answers to questions we ask, or writing software that falsifies metrics instead of doing the things the metrics were supposed to measure (e.g., software meant to make a website run faster might instead falsify metrics about its loading time).

We theorize that modifying the AI training in some way6 will make AI systems less likely to behave deceptively. We try it out, and find that, in fact, our AI systems seem to be behaving better than before - we are finding fewer incidents in which they behaved in unintended or deceptive ways.

But that's just a statement about what we're noticing. Which of the following just happened:

  • Did we just train our AI systems to be less deceptive?
  • Did we just train our AI systems to be better at deception, and so to make us think they became less deceptive?
  • Did we just train our AI systems to be better at calculating when they might get caught in deception, and so to be less deceptive only when the deception would otherwise be caught?
    • This one could be useful! Especially if we're able to set up auditing systems in many real-world situations, such that we could expect deception to be caught a lot of the time. But it does leave open the King Lear problem covered next.

(...Or some combination of the three?)

We're hoping to be able to deploy AI systems throughout the economy, so - just like human specialists - they will almost certainly have some opportunities to be deceptive without being caught. The fact that they appear honest in our testing is not clear comfort against this risk.

The analogy here is to competitive cyclist Lance Armstrong. Armstrong won the Tour de France race 7 times in a row, while many of his competitors were caught using performance-enhancing drugs and disqualified. But more than 5 years after his last win, an investigation "concluded that Armstrong had used performance-enhancing drugs over the course of his career[5] and named him as the ringleader of 'the most sophisticated, professionalized and successful doping program that sport has ever seen'." Now the list of Tour de France winners looks like this:

A broader issue here is that when AI systems become capable enough, AI safety research starts to look more like social sciences (studying human beings) than like natural sciences. Social sciences are generally less rigorous and harder to get clean results from, and one factor in this is that it can be hard to study someone who's aware they're being studied.7

Two broad categories of research that might help with the Lance Armstrong problem:

  • Mechanistic interpretability8 can be thought of analyzing the "digital brains" of AI systems (not just analyzing their behavior and performance.) Currently, AI systems are black boxes in the sense that they perform well on tasks, but we can't say much about how they are doing it; mechanistic interpretability aims to change this, which could give us the ability to "mind-read" AIs and detect deception. (There could still be a risk that AI systems are arranging their own "digital brains" in misleading ways, but this seems quite a bit harder than simply behaving deceptively.)
  • Some researchers work on "scalable supervision" or "competitive supervision." The idea is that if we are training an AI system that might become deceptive, we set up some supervision process for it that we expect to reliably catch any attempts at deception. This could be because the supervision process itself uses AI systems with more resources than the one being supervised, or because it uses a system of randomized audits where extra effort is put into catching deception.
(Click to expand) Why are AI systems "black boxes" that we can't understand the inner workings of?

I explain this briefly in an old Cold Takes post; it's explained in more detail in more technical pieces by Ajeya Cotra (section I linked to) and Richard Ngo (section 2).

What I mean by “black-box trial-and-error” is explained briefly in an old Cold Takes post, and in more detail in more technical pieces by Ajeya Cotra (section I linked to) and Richard Ngo (section 2). Here’s a quick, oversimplified characterization.

Today, the most common way of building an AI system is by using an "artificial neural network" (ANN), which you might think of sort of like a "digital brain" that starts in an empty (or random) state: it hasn't yet been wired to do specific things. A process something like this is followed:

  • The AI system is given some sort of task.
  • The AI system tries something, initially something pretty random.
  • The AI system gets information about how well its choice performed, and/or what would’ve gotten a better result. Based on this, it “learns” by tweaking the wiring of the ANN (“digital brain”) - literally by strengthening or weakening the connections between some “artificial neurons” and others. The tweaks cause the ANN to form a stronger association between the choice it made and the result it got.
  • After enough tries, the AI system becomes good at the task (it was initially terrible).
  • But nobody really knows anything about how or why it’s good at the task now. The development work has gone into building a flexible architecture for it to learn well from trial-and-error, and into “training” it by doing all of the trial and error. We mostly can’t “look inside the AI system to see how it’s thinking.”
  • For example, if we want to know why a chess-playing AI such as AlphaZero made some particular chess move, we can't look inside its code to find ideas like "Control the center of the board" or "Try not to lose my queen." Most of what we see is just a vast set of numbers, denoting the strengths of connections between different artificial neurons. As with a human brain, we can mostly only guess at what the different parts of the "digital brain" are doing.

(2) The King Lear problem: how do you test what will happen when it's no longer a test?

The Shakespeare play King Lear opens with the King (Lear) stepping down from the throne, and immediately learning that he has left his kingdom to the wrong two daughters. Loving and obsequious while he was deciding on their fate,9 they reveal their contempt for him as soon as he's out of power and they're in it.

If we're building AI systems that can reason like humans, dynamics like this become a potential issue.

I previously noted that an AI with any ambitious aim - or just an AI that wants to avoid being shut down or modified - might calculate that the best way to do this is by behaving helpfully and safely in all "tests" humans can devise. But once there is a real-world opportunity to disempower humans for good, that same aim could cause the AI to disempower humans.

In other words:

  • (A) When we're developing and testing AI systems, we have the power to decide which systems will be modified or shut down and which will be deployed into the real world. (Like King Lear deciding who will inherit his kingdom.)
  • (B) But at some later point, these systems could be operating in the economy, in high numbers with a lot of autonomy. (This possibility is spelled out/visualized a bit more here and here.) At that point, they may have opportunities to defeat all of humanity such that we never make decisions about them again. (Like King Lear's daughters after they've taken control.)
(Click to expand) How could AI defeat humanity?

In a previous piece, I argue that AI systems could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.

By defeating humanity, I mean gaining control of the world so that AIs, not humans, determine what happens in it; this could involve killing humans or simply “containing” us in some way, such that we can’t interfere with AIs’ aims.

One way this could happen is if AI became extremely advanced, to the point where it had "cognitive superpowers" beyond what humans can do. In this case, a single AI system (or set of systems working together) could imaginably:

  • Do its own research on how to build a better AI system, which culminates in something that has incredible other abilities.
  • Hack into human-built software across the world.
  • Manipulate human psychology.
  • Quickly generate vast wealth under the control of itself or any human allies.
  • Come up with better plans than humans could imagine, and ensure that it doesn't try any takeover attempt that humans might be able to detect and stop.
  • Develop advanced weaponry that can be built quickly and cheaply, yet is powerful enough to overpower human militaries.

However, my piece also explores what things might look like if each AI system basically has similar capabilities to humans. In this case:

  • Humans are likely to deploy AI systems throughout the economy, such that they have large numbers and access to many resources - and the ability to make copies of themselves.
  • From this starting point, AI systems with human-like (or greater) capabilities would have a number of possible ways of getting to the point where their total population could outnumber and/or out-resource humans.
  • I address a number of possible objections, such as "How can AIs be dangerous without bodies?"

More: AI could defeat all of us combined

If AI systems can detect the difference between (A) and (B) above, then their behavior might systematically change from one to the other - and there's no clear way to test their behavior in (B).

(Click to expand) The Volkswagen emissions scandal

In 2015, the US Environmental Protection Agency "found that Volkswagen had intentionally programmed ... engines to activate their emissions controls only during laboratory emissions testing, which caused the vehicles' [nitrogen oxide] output to meet US standards during regulatory testing. However, the vehicles emitted up to 40 times more [nitrogen oxide] in real-world driving." (From Wikipedia)

This is an example of the "King Lear problem": there was a deliberate distinction made where the engines were behaving differently during tests and during real-world use. I don't know much about this incident, but I'm guessing that regulators were able to detect the issue because of (a) independent tests under road conditions; (b) being able to understand the "defeat device" that the Volkswagen cars used to create the misleading readings.

Neither of these would necessarily apply to the King Lear problem as described here: (a) you can test a car under road conditions to see its emissions, but you can't (or at least, shouldn't) test how an AI behaves when it has an opportunity to disempower humans entirely; (b) AI systems may be black boxes such that it's hard to understand what's going on inside them.

In general, modern machine learning researchers consider it challenging to handle what's called "distributional shift": systematic differences between situations AIs were trained on and situations they're now in. To me, the King Lear problem looks like arguably the most inconvenient possible distributional shift: AI systems risk behaving in unexpected ways just as (and in fact, because) they're now able to defeat humanity, rather than being in a controlled test environment.

Some lines of research that might help here:

  • If we could solve the Lance Armstrong problem robustly enough - such that we could be confident AIs were never behaving deceptively - we could simply prompt AIs to answer questions like "Would AI system X disempower humans given an opportunity to do so?" The digital-brain-based techniques noted above could imaginably get us here.
  • There might be ways of specifically trying to target the worst-case behavior of AI systems, so that they are nearly guaranteed not to behave in certain ways regardless of their situation. This could look something roughly like "simulating cases where an AI system has an opportunity to disempower humans, and giving it negative reinforcement for choosing to do so." More on this sort of approach, along with some preliminary ongoing work, here.

(3) The lab mice problem: the AI systems we'd like to study don't exist today

Above, I said: "when AI systems become capable enough, AI safety research starts to look more like social sciences (studying human beings) than like natural sciences." But today, AI systems aren't capable enough, which makes it especially hard to have a meaningful test bed and make meaningful progress.

Specifically, we don't have much in the way of AI systems that seem to deceive and manipulate their supervisors,10 the way I worry that they might when they become capable enough.

In fact, it's not 100% clear that AI systems could learn to deceive and manipulate supervisors even if we deliberately tried to train them to do it. This makes it hard to even get started on things like discouraging and detecting deceptive behavior.

I think AI safety research is a bit unusual in this respect: most fields of research aren't explicitly about "solving problems that don't exist yet." (Though a lot of research ends up useful for more important problems than the original ones it's studying.) As a result, doing AI safety research today is a bit like trying to study medicine in humans by experimenting only on lab mice (no human subjects available).

This does not mean there's no productive AI safety research to be done! (See the previous sections.) It just means that the research being done today is somewhat analogous to research on lab mice: informative and important up to a point, but only up to a point.

How bad is this problem? I mean, I do think it's a temporary one: by the time we're facing the problems I worry about, we'll be able to study them more directly. The concern is that things could be moving very quickly by that point: by the time we have AIs with human-ish capabilities, companies might be furiously making copies of those AIs and using them for all kinds of things (including both AI safety research and further research on making AI systems faster, cheaper and more capable).

So I do worry about the lab mice problem. And I'd be excited to see more effort on making "better model organisms": AI systems that show early versions of the properties we'd most like to study, such as deceiving their supervisors. (I even think it would be worth training AIs specifically to do this;11 if such behaviors are going to emerge eventually, I think it's best for them to emerge early while there's relatively little risk of AIs' actually defeating humanity.)

(4) The "first contact" problem: how do we prepare for a world where AIs have capabilities vastly beyond those of humans?

All of this piece so far has been about trying to make safe "human-like" AI systems.

What about AI systems with capabilities far beyond humans - what Nick Bostrom calls superintelligent AI systems?

Maybe at some point, AI systems will be able to do things like:

  • Coordinate with each other incredibly well, such that it's hopeless to use one AI to help supervise another.
  • Perfectly understand human thinking and behavior, and know exactly what words to say to make us do what they want - so just letting an AI send emails or write tweets gives it vast power over the world.
  • Manipulate their own "digital brains," so that our attempts to "read their minds" backfire and mislead us.
  • Reason about the world (that is, make plans to accomplish their aims) in completely different ways from humans, with concepts like "glooble"12 that are incredibly useful ways of thinking about the world but that humans couldn't understand with centuries of effort.

At this point, whatever methods we've developed for making human-like AI systems safe, honest, and restricted could fail - and silently, as such AI systems could go from "behaving in honest and helpful ways" to "appearing honest and helpful, while setting up opportunities to defeat humanity."

Some people think this sort of concern about "superintelligent" systems is ridiculous; some13 seem to consider it extremely likely. I'm not personally sympathetic to having high confidence either way.

But additionally, a world with huge numbers of human-like AI systems could be strange and foreign and fast-moving enough to have a lot of this quality.

Trying to prepare for futures like these could be like trying to prepare for first contact with extaterrestrials - it's hard to have any idea what kinds of challenges we might be dealing with, and the challenges might arise quickly enough that we have little time to learn and adapt.

The young businessperson

For one more analogy, I'll return to the one used by Ajeya Cotra here:

Imagine you are an eight-year-old whose parents left you a $1 trillion company and no trusted adult to serve as your guide to the world. You must hire a smart adult to run your company as CEO, handle your life the way that a parent would (e.g. decide your school, where you’ll live, when you need to go to the dentist), and administer your vast wealth (e.g. decide where you’ll invest your money).

You have to hire these grownups based on a work trial or interview you come up with -- you don't get to see any resumes, don't get to do reference checks, etc. Because you're so rich, tons of people apply for all sorts of reasons. (More)

If your applicants are a mix of "saints" (people who genuinely want to help), "sycophants" (people who just want to make you happy in the short run, even when this is to your long-term detriment) and "schemers" (people who want to siphon off your wealth and power for themselves), how do you - an eight-year-old - tell the difference?

This analogy combines most of the worries above.

  • The young businessperson has trouble knowing whether candidates are truthful in interviews, and trouble knowing whether any work trial actually went well or just seemed to go well due to deliberate deception. (The Lance Armstrong problem.)
  • Job candidates could have bad intentions that don't show up until they're in power (the King Lear Problem).
  • If the young businessperson were trying to prepare for this situation before actually being in charge of the company, they could have a lot of trouble simulating it (the lab mice problem).
  • And it's generally just hard for an eight-year-old to have much grasp at all on the world of adults - to even think about all the things they should be thinking about (the first contact problem).

Seems like a tough situation.

Previously, I talked about the dangers of AI if AI developers don't take specific countermeasures. This piece has tried to give a sense of why, even if they are trying to take countermeasures, doing so could be hard. The next piece will talk about some ways we might succeed anyway.

 


Footnotes

  1. Or persuaded (in a “mind hacking” sense) or whatever. 

  2. Research? Testing. Whatever. 

  3. Drugs can be tested in vitro, then in animals, then in humans. At each stage, we can make relatively straightforward observations about whether the drugs are working, and these are reasonably predictive of how they'll do at the next stage. 

  4. You can generally see how different compounds interact in a controlled environment, before rolling out any sort of large-scale processes or products, and the former will tell you most of what you need to know about the latter. 

  5. New software can be tested by a small number of users before being rolled out to a large number, and the initial tests will probably find most (not all) of the bugs and hiccups. 

  6. Such as:

    • Being more careful to avoid wrong answers that can incentivize deception
    • Conducting randomized "audits" where we try extra hard to figure out the right answer to a question, and give an AI extra negative reinforcement if it gives an answer that we would have believed if not for the audit (this is "extra negative reinforcement for wrong answers that superficially look right")
    • Using methods along the lines of "AI safety via debate" 
  7. Though there are other reasons social sciences are especially hard, such as the fact that there are often big limits to what kinds of experiments are ethical, and the fact that it's often hard to make clean comparisons between differing populations

  8. This paper is from Anthropic, a company that my wife serves as President of. 

  9. Like, he actually asks them to talk about their love for him just before he decides on what share of the realm they'll get. Smh 

  10. This paper is a potential example, but its results seem pretty brittle

  11. E.g., I think it would be interesting to train AI coding systems to write underhanded C: code that looks benign to a human inspector, but does unexpected things when run. They could be given negative reinforcement when humans can correctly identify that the code will do unintended things, and positive reinforcement when the code achieves the particular things that humans are attempting to stop. This would be challenging with today's AI systems, but not necessarily impossible. 

  12. This is a concept that only I understand. 

  13. E.g., see the discussion of the "hard left turn" here by Nate Soares, head of MIRI. My impression is that others at MIRI, including Eliezer Yudkowsky, have a similar picture.