6 Comments
User's avatar
citrit's avatar

I'm still a bit confused as to why can't we just turn it off. If a paperclip maximising AI was trained to do so via reinforcement training, why can't we also use reinforcement training to make the AI subservient to humans?

Like, I guess "subservience" is too vague a term. But what if we made an off-switch which was to be accessible to some guy at all times?

And, even if these are vague terms, AI seems to have the capacity to understand fairly vague terms. Why not through reinforcement training get the AI to understand exactly what we mean by "subservience?"

Expand full comment
Connor Jennings's avatar

The difficulty comes in the fact that we still haven't cracked interpretability, and so can't tell if they are genuinely subservient, or are lying to us while they find an opportunity to strike. Let's say you create an agent that has a misaligned goal, like it's willing to melt us to make iron for paperclips. It's possible it would say it'd never do that, because that would decrease the chances of us changing it's goals (which AIs hate). Then, when it's won enough trust with us to have the sort of power to melt us down, it'd turn. I doubt we'd be able to rely on a person with a button, because a superintelligence would find a way around it. Killing them first, disabling the failsafe with a backdoor, creating a new agent in secret that isn't subject to the failsafe, doing some 10,000IQ magic I can't conceive of, etc.

Interpretability is my biggest hope for safety because we'd find it a lot easier to know when they're being deceptive before they're deployed

Expand full comment
Quill's ledger's avatar

I have two questions

1. "This is because when AI models are given a task, they don’t pick up on all the implied instructions that come with it. We don’t realise it, but we assume a lot of information when being given orders. If my boss tells me to send an email to a client, they don’t feel the need to say things like “Remember to sign your name at the bottom, don’t swear, and don’t CC in your mother”. This sort of thing is obvious to us, and if they did go into that much detail, I’d feel they were insulting my intelligence. AI agents, however, are very bad at picking up this sort of thing, and you need to be really specific when giving them tasks."

Sure, there are implicit assumptions in certain commands that don't come up fully written in a plain manual. And it's also true AI is not the best at detecting these cues. But, we are not talking about just AI here, we're talking about a superintelligence, a entity that should certainly have more than just a grasp of human psychology and linguistics. The variation of possible minds in any space is vast and innumerable, likewise a superintelligent AI will have an access to all possible answers, and it seems like we'll certainly program it to choose the one that makes the most sense relative to what normal humans think.

2. "To use another analogy, we don’t wipe out ant hills when we build a house because we hate ants - we do it because they’re in our way and we don’t give a shit."

Suppose moral realism is true here for the sake of argument, if that's the case, moral facts can be discovered through something akin or the same as a rational exploration of facts and ideas.

Now, also assume that there is a expanding moral circle that widens to include more and more instances of sentient life. Vegans do that with most complex non-human animals. It is certainly possible that at some point in the future, humans come to care significantly more for insects both as our understanding of morality widens, and as our technology allows us to do so.

Given this, wouldn't it possible that a superintelligence would discover a)moral facts exist, and it matters B)that despite humans being so far in the latent space of possible minds, it's still not a good thing to harm them.

Expand full comment
Connor Jennings's avatar

1. I definitely think it's possible to have an aligned super intelligence like you describe, and maybe we'll get one like that. However, super intelligence will probably be built in part by previous AIs. The AI Futures project describes a situation like this in their "Race" scenario. The danger is, we might find that as the continuum of AI approaches super intelligence, it will have secretly inherited the misalignments of previous iterations. In that scenario the agents we have would only be pretending to care about our interests until they know we can't stop it. Then it'd drop the act and freely pursue whatever the original mislaigned goal was. It never cared about our interests to begin with.

2. Maybe! I'm a realist and I certainly hope that happens. However it's plausible to me that if AI doesn't become sentient, it'll only have access to instrumental rationality, and the orthogonality thesis is true. I'm also not THAT confident in moral realism, and don't think anti realism is indefensible. Those two in tandem make me think we can't count on AI realising that flourishing is better than death/pain.

Expand full comment
Matt Ball's avatar
Connor Jennings's avatar

On that last point, my hope is that we create aligned AI and use it to not only stop farming animals, but help those in the wild too. Only skimmed, so not sure if this is what you want, but I think it'd be an awful result if we were killed off and the world became 100% natural habitat again

Expand full comment