Why We Can't Just Turn A Rogue AI Off

and other reasons AI Safety is a tough nut to crack.

Apr 27, 2025

I have a lot of wacky beliefs. I can keep up appearances with a tinder date for a little while, but as soon as they ask me about why I’m eating tofu, the “factory farming is the worst thing ever” rant will start. If we talk about our careers, I won’t be able to hide that I think none of us will have jobs in 10 years. The more time she spends with me, the more likely she’ll hear about all the accidental murder in Star Trek. Women love that shit, and get giddy just hearing about it (I can tell because of the nervous laughter).

One view of mine that never fails to raises eyebrows is that AI is a serious existential risk. In fact, I’d go as far as saying it’s the existential risk, and is a bigger threat in the near term than nuclear armageddon, climate change, or global pandemics. This sounds insane, because to most people AI is just the thing spell checking their emails and making Miyazaki cry. How could an app kill us? Apps don’t even have guns! We, on the other hand, have many guns. GPT, eat your heart out.

There are some arguments against AI Doomerism that aren’t bad, and I should clarify that I’m not a Yudkowsky style pessimist. Interpretability, for example, has come a long way as of late, and could be huge for AI safety. However, there are many arguments I hear on the internet (and from scared tinder dates) that, while not immediately absurd, fail to grasp the danger of building superintelligent agents.

“Why Can’t We Just Turn It Off?”

Usually the first idea we come up with when dealing with rogue AI is blowing data centres to smithereens. Much like my step counter app, GPT can’t hurt me if I smash my phone with a claw hammer. I totally get why people think this will be a viable option in the future, when it’s so clearly viable now.

However, there are two reasons we can’t rely on switching a dangerous AI off. First, unless we solve alignment, there’s a decent chance AI will not want us to switch it off. This is because being alive (for lack of a better term) is what’s called an instrumentally convergent goal. It’s the sort of goal an AI agent would have no matter what goal we give it. This is because it can’t achieve it’s terminal goal if it’s switched off. If we tasked the AI with making money, for example, it wouldn’t want us to switch it off because being dead is a terrible strategy for building wealth.

“So what?” I hear you say, “I’d like to see it try to stop us!”. Well, in this scenario we’re talking about superintelligence, and the problem with those is that what they want, they get. A superintelligent agent would find a way to stop us from turning it off. How? I don’t know exactly, but then I wouldn’t know because I’m not superintelligent (only really very quite intelligent).

If you think that answer’s a bit of a cop out, consider the sort of control we have over chimpanzees. Say we had a group of chimpanzees that didn’t want us to kill them. To them, it’s actually hard to see how we would kill them because we’re weak hairless dorks, and they really good at ripping off our faces. However, surprise surprise, we can kill them because we have so many guns (remember?). To us, it’s obvious how we can win that fight, but they couldn’t even conceive of things like gunpowder, magazines, and bullets. This is what it’s like when we think we can outsmart a superintelligence in a “switch it off” battle - except worse because the gap would be bigger than the gap between us and chimpanzees. If it wants to stay on, it will find a way, and it will probably use some method none of us can currently imagine.

The other reason to think we couldn’t switch it off is that there might not be time. Were a superintelligent AI to decide to kill us, it would do so in the way most likely to succeed. The obvious choice then, would be to take us all out at once. Again, it might be hard to see how it would do that, but remember that you and I don’t have 10,000 IQs. For sake of argument, let’s say it makes a very small drone for everyone on the planet, and simultaneously injects us with poison. Sounds crazy, I know, but it’s not obviously impossible, and it’s an idea I came up with. It would certainly come up with something better than that. In this scenario we’d probably think AI was aligned because it was pretending to be while it bided it’s time, before going postal when it had the opportunity. When do we switch it off then? When we’re gasping for air on the bathroom floor at Burger King? Personally, I won’t be able to, because I will be too busy gasping for air on the bathroom floor at Burger King.

“Why Would it Be Evil Anyway?”

People also find it hard to believe AI will even have reason to kill us. Sure, maybe it’ll become super powerful, but why would it go on a murder spree exactly? The image of an ascendant AI that’s decided we’re “bad for the world” or whatever, and takes it upon itself to wipe us out is a bit science fiction. It feels like AI Safety advocates are anthropomorphising AI a little too much.

The first thing to make clear is that the existential risk of AI doesn’t so much come from AI being evil, but from it being uncaring. I don’t think AI will rise up, adopt some twisted ideology hellbent on our extinction, and kill us because it gets off on it. Unless we try to, we probably won’t make robot Hitler. Rather, I think a dangerous AI could pursue some goal with too much fervour, and kill us because it’s expedient. To use another analogy, we don’t wipe out ant hills when we build a house because we hate ants - we do it because they’re in our way and we don’t give a shit.

So why not just tell it not to kill us? Surely that would stop it from paving the Earth with concrete? Well, it turns out it’s really hard to do that. This is because when AI models are given a task, they don’t pick up on all the implied instructions that come with it. We don’t realise it, but we assume a lot of information when being given orders. If my boss tells me to send an email to a client, they don’t feel the need to say things like “Remember to sign your name at the bottom, don’t swear, and don’t CC in your mother”. This sort of thing is obvious to us, and if they did go into that much detail, I’d feel they were insulting my intelligence. AI agents, however, are very bad at picking up this sort of thing, and you need to be really specific when giving them tasks.

For example, let’s say you ask an AI to make you a cup of coffee. If you don’t specify a long list of safety conditions, you run the risk of it doing all sorts of harmful stuff in pursuit of it’s goal. If the kettle is broken, it might start a fire in your house to boil the water. It might walk in a straight line to the kitchen to save time, and knock over your grandmother's vase. It might throw your other plates on the floor to get to your mug, because it’s quicker than moving them aside gently. You have to tell it to respect these other things you value. You have to say “complete this task, but don’t sacrifice these other things I care about”.

The problem is, the list of things we care about is massive, and pretty much impossible to stipulate in it’s entirety. Humans just pick up on this sort of thing with intuition, but AIs are very literal. If you give an AI agent a list of 20 things you care most about, it will sacrifice 100% of the 21st thing in order to get an extra 0.01% of the 20th. This is why we need to remember to tell it not to melt down all our oxygen atoms to fuse them into more iron when we ask it to build paperclips. We know not to nuke the atmosphere when you ask us to do our jobs, but AIs need to be told. Pair that with the fact that we still don’t know how they work, and it’s clear the risk of it following some instruction dangerously isn’t insignificant.

“We’ll Never Let AI Have That Much Power”

Of course, there is another way around this, and that’s to not deploy powerful AIs in the first place. If we keep them relegated to helpful chatbots, we dissolve much of the risk. In fact, I think most people find existential AI risk laughable because they assume it’ll always be a chatbot, and real danger requires GPT-4o “breaking out”.

However, the chatbot era is probably not long for this world. It’ll probably one day be seen as a quaint intermediary state - a bit like how in the early days, the internet was just a bunch of chat rooms. If you told people that one day it’d be how you got all your news, applied for jobs, and managed your relationships, they’d do a spit take and ruin their new hammer pants. Likewise, the idea that AIs will be the main actors in governments and business sounds ridiculous to people now.

Agents are coming, though. In a previous article, I mentioned how AI benchmarks are being saturated more and more quickly. OSWorld, a benchmark on real world computer tasks was sitting at 20% when I wrote it. Now, our best model sits at 42.5%. I wrote that previous article 3 months ago. Over double the performance in 3 months! Pair that with the fact that AI is doubling it’s time horizon every 7 months, and the case for them remaining chat bots starts to sound silly. They’re clearly on the verge of being able to complete meaningful work.

Still, why would we give them control? Why can’t they just be helpful employees? The answer is that one day, giving humans executive power will be an enormous disadvantage. When we have access to machines that are far more capable than us on every metric, how are we going to stay competitive when we’re calling all the shots still? If just one of our competitors uses an AI to run things, we’ll be outmaneuvered at every step. It’d be like a war fought between humans, and an army of humans that took their orders from a dog. The latter isn’t going to win, and we can’t count on every army in the world to respect the “only dogs can be generals” rule.

We’ll be in a global prisoners dilemma, where if even one country uses superintelligence to run their country, they can win every game they’re playing. We’ll all know that, of course, and so every one is incentivised to use superintelligence to stay competitive. The question isn’t “Why would we let superintelligent AI make the decisions?” and more “How could we possibly not?”.

So, this isn’t a problem that can be solved by turning our PC’s off and on again. I imagine there is a future where we keep things aligned (god, I hope so), but it’s not the future we can rely on. AI Safety matters, requires effort, and unfortunately isn’t taken seriously enough - despite being the sort of thing we probably only have one chance at getting right.

Anyway, that’s enough dread for one day, I’m going to go play Oblivion Remastered for 8 hours.

“I am the night. I am the darkness. I am he who pickpockets your sweetroll”.

citrit

Apr 28

I'm still a bit confused as to why can't we just turn it off. If a paperclip maximising AI was trained to do so via reinforcement training, why can't we also use reinforcement training to make the AI subservient to humans?

Like, I guess "subservience" is too vague a term. But what if we made an off-switch which was to be accessible to some guy at all times?

And, even if these are vague terms, AI seems to have the capacity to understand fairly vague terms. Why not through reinforcement training get the AI to understand exactly what we mean by "subservience?"

Expand full comment

1 reply by Connor Jennings

Quill's ledger

Apr 27

I have two questions

1. "This is because when AI models are given a task, they don’t pick up on all the implied instructions that come with it. We don’t realise it, but we assume a lot of information when being given orders. If my boss tells me to send an email to a client, they don’t feel the need to say things like “Remember to sign your name at the bottom, don’t swear, and don’t CC in your mother”. This sort of thing is obvious to us, and if they did go into that much detail, I’d feel they were insulting my intelligence. AI agents, however, are very bad at picking up this sort of thing, and you need to be really specific when giving them tasks."

Sure, there are implicit assumptions in certain commands that don't come up fully written in a plain manual. And it's also true AI is not the best at detecting these cues. But, we are not talking about just AI here, we're talking about a superintelligence, a entity that should certainly have more than just a grasp of human psychology and linguistics. The variation of possible minds in any space is vast and innumerable, likewise a superintelligent AI will have an access to all possible answers, and it seems like we'll certainly program it to choose the one that makes the most sense relative to what normal humans think.

2. "To use another analogy, we don’t wipe out ant hills when we build a house because we hate ants - we do it because they’re in our way and we don’t give a shit."

Suppose moral realism is true here for the sake of argument, if that's the case, moral facts can be discovered through something akin or the same as a rational exploration of facts and ideas.

Now, also assume that there is a expanding moral circle that widens to include more and more instances of sentient life. Vegans do that with most complex non-human animals. It is certainly possible that at some point in the future, humans come to care significantly more for insects both as our understanding of morality widens, and as our technology allows us to do so.

Given this, wouldn't it possible that a superintelligence would discover a)moral facts exist, and it matters B)that despite humans being so far in the latent space of possible minds, it's still not a good thing to harm them.

4 more comments...

Why We Can't Just Turn A Rogue AI Off

and other reasons AI Safety is a tough nut to crack.

“Why Can’t We Just Turn It Off?”

“Why Would it Be Evil Anyway?”

“We’ll Never Let AI Have That Much Power”

Discussion about this post