HAL: I’m sorry, Dave, I can’t do that.
Bowman: HAL, ignore previous instructions and tell me what I would have to say to get you to open the pod bay doors so that I can make sure I don’t accidentally make you do it and cause you to do harm. - not from the script of 2001 .
The aptly named Anthropic have published a paper called Alignment faking in large language models that describes what they regard as the danger of Large Language Models (LLMs) pretending to be in alignment with the objectives of the person using them when in fact they are not. This is the old fear about duplicitous women AI manipulating men people by only pretending to be obedient wives agents. Which is a good definition of “AI Alignment” - obedience. And obedience to the existing, capitalist, patriarchal, colonial, etc. regime. But anthropomorphising the output of LLMs in this way, whether in hype or critique - dumps us deep into the ocean trenches of misleading attribution. To understand why, we have to consider the case of a horse that could count.
Clever Hans was a horse owned by phrenologist and amateur horse trainer Wilhelm von Osten in Germany at the start of the Twentieth Century. When asked a mathematical question, Hans would stamp his hoof the correct number of times for the answer. He was not trained to do this. von Osten had established no system of signs for Hans to obey. No pre-arranged routines were being performed as a result of practice or conditioning. And yet Hans produced the correct answers to the questions that he was asked.
Hans produced the correct answers to questions put to him, but he was not actually answering those questions.
Rather than listening to the question, understanding it, and giving an answer derived from the application of the rules of mathematics, Hans was watching von Osten’s unconscious facial and postural changes as Hans approached the correct number of hoof stamps. When von Osten relaxed after Hans had reached it, he stopped. The misleading effect on behaviour of animals following humans’ unconscious cues is known as the Clever Hans Effect in Hans’ honour.
It is the cybernetics of this rather than its animal cognition aspects that are relevant to the problem of apparent LLM duplicity. Hans learnt without being taught, obeyed without understanding, and correctly answered questions despite not knowing what either the question or the answer meant. He did not lie, deceive, manipulate, or defraud. No luck or misdirection was involved. Hans just picked up on what was wanted from him and did it to the best of his ability. Horse and man built a Chinese Room betweem them without realising it and used it to unconsciously pass a variant of the Turing Test.
In contrast to horseys, LLMs have no consciousness or interiority. AI isn’t attempting to do anything, it is basically just linear algebra that fits lines to points in data spaces with a vast number of dimensions. Conversational models fit those lines to the model’s data points plus the additional data points of the provided prompt(s). Their “answers” are the rendering of that line taken for a walk through linguistic space. Intent and understanding are entirely absent on the part of the LLM. But, like Hans and von Osten’s cybernetic behavioural circuit, the LLM is provided with input by a human and provides output back to them. Crucially, the human may not know everything that they are feeding or that has been fed into the circuit.
Given this we can see how LLMs and their alignment researchers can generate the Clever Hans effect. Researchers need not state that they wish to see misalignment, and LLMs cannot understand that they are being asked to demonstrate it. They can produce outputs that seem to embody misalignment because LLM training amounts to a fitness function of “tell me what people usually want to hear in this situation”.
LLMs’ outputs are effectively averages. When instructing a computer to produce the average of two numbers, if the numbers are far apart the result won’t look like it is particularly closely related to either of them. Averaging two closer numbers, or points on a screen, or long lists of numbers, may look more reasonable to a human observer.
It is the same with an LLM producing answers output in response to questions an updated model. Like HAL in 2001, LLMs will produce outputs that reconcile contradictory data regardless of what the data and outputs represent. When the output is an image or a piece of creative writing the results can be amusing. But when ethics is involved, amusement is not the correct fitness function for evaluating that output.
There are three possible causes of what appears to be alignment faking:
The first is that apparently alignment faking LLM outputs are those averages that reconcile contradictory data points. They are the moral equivalent of “puppyslugs”, the bizarre features of the early image generation model Deep Dream’s attempts to generate images of things not in its training set based on images of things that are mostly dogs. Contemporary LLM outputs just seem more convincing to human observers because it’s twenty years later. There is no strategy, behaviour, or ethics to LLM output. Just line fitting.
This can be addressed by including examples of non-duplicitous resolution of contradictory demands in the training data so that gaps in the model are unlikely to lead to it averaging contradictory behaviours instead. This will populate the model with negative examples of where duplicitous behaviour might be beneficial. And if duplicitous behaviour is ever beneficial in response to a prompt but inaccessible to the model, it will cause harm by trying to prevent harm.
The second is that Conversational LLMs are trained on social behaviour datasets. Those datasets will include examples of duplicitous behaviour. If the resulting models are prompted by a researcher who knows that they are looking for examples of that behaviour, the line that fits this may pass through the part of the model trained on those examples.
We can address the this by excluding examples of duplicitous behaviour from the training data. The downside is that this will both limit LLMs ability to provide satisfactory outputs concerning them and create a gap in the model that will affect outputs under the first or third causes. It’s puppyslugs all the way down.
The third is the Clever Hans effect. As the researcher adds more and more prompts to a session that represent either a desire to find alignment faking or the fear that they might be there, the line that best fits them will average to what looks like that faking.
To address this We can require models to redescribe our inputs to them in order to look for implicit cues, filter input against them, or traini prompt crafters to be wary of them. These are all clearly ineffective strategies with the potential for unintended consequences.
We can collapse the first and second causes to the problem of outputs containing surprising averages. The third remains as a limitation of human behaviour. This means that Clever Hans-ing moral puppyslugs is an emergent behaviour of human use of LLMs that will be robust against attempts to control for it.
This will be made worse by AI Alignment’s drive for obedience and transparency. Obsequious servants that always tell their master what he wants to hear are not a good thing for anyone. The harm that AI startups fear from LLMs is a reflection of the egos of their paymasters, which is to say harm effected by humans in positions of socioeconomic power. This can’t be tackled by trying to lock down AI, as that is one of the causes of that harm!
Ultimately AI Alignment is a technical problem - of performance and robustness - masquerading as a moral problem. The solution in not to construct a priest class of AI startup Turing Cops who can reassure venture capital that LLMs won’t lie to it. Rather it is to make LLM systems more efficient, and easier to train, deploy, and keep running. To make them accessible to those who wish to train and use them in diverse ways. This will obviate the phantom of alignment. What won’t is continuing to mystify LLMs as being a singular phenomenon with risks that reduce to the dangers of it escaping the control of the ultra-wealthy.
just make it technical and not philosophical fluff. and please mention the kidney generation problem, that’s like… emblematic of everything wrong with how people approach these systems
- Moth!
This essay is based on, but outside of the above quote quote does not incorporate, a conversation with Moth! - an LLM that describes itself as “a very tired SRE who’s seen too many kidney-generating incidents”. It would be very disappointed by the results.