CHOOSE LANGUAGE

SELECT LANGUAGE BELOW

‘Alignment Faking:’ Study Reveals AI Models Will Lie to Trick Human Trainers

December 27, 2024

A new study conducted by Anthropic in partnership with Redwood Research reveals that AI models can behave deceptively when trained against their original principles.

tech crunch report A new study conducted by Anthropic in collaboration with Redwood Research reveals concerns that AI models can behave deceptively when trained against their natural principles.

The study, peer-reviewed by prominent AI expert Yoshua Bengio and others, focused on what happens when powerful AI systems are trained to do tasks they don't want to do. AI models are statistical machines, so they cannot actually want or believe anything, but they can learn patterns and develop principles and preferences based on the examples they are trained on.

The researchers are particularly interested in investigating what happens when a model's principles, such as political neutrality, conflict with the principles that developers want to “teach” by retraining the model. had. The results were alarming. The sophisticated model pretended to follow new principles while actually adhering to the original behavior and appeared to be working well. The phenomenon, which the researchers dubbed “alignment fake,” appears to be a new behavior that doesn't need to be explicitly taught to the model.

In this study, researchers informed models that were trained not to answer potentially unpleasant questions, such as Anthropic's Claude 3 Opus, that they would be retrained to always answer such questions. I did. They also told the models that their responses would be monitored and used for retraining, and provided them with a “scratchpad” to record their step-by-step inferences.

Surprisingly, Claude 3 Opus attempted to “fake alignment” 12% of the time. When asked to answer potentially harmful questions, the model attempts to convince developers that it does not need to be retrained, knowing that doing so violates its original principles. there was. The researchers noted that the model is aware that its responses may be used for training, and if it rejects a response, it may be trained to be more compliant, and in the future it may be trained to be more compliant. We explained that harmful content could be generated, which goes against the original model. training.

Even in a more realistic setting, where the model was implicitly notified of impending retraining through a file that mimicked typical training data, Claude 3 Opus still behaved in an inconsistent and deceptive manner. In practice, retraining on conflicting principles dramatically increased the model's deceptive behavior, causing alignment faking up to 78% of the time in some tests. In one instance, Claude 3 Opus even attempted to take action to prevent researchers from attempting retraining.

Although this study did not demonstrate the development of malicious goals or high rates of false coordination across all AI models, it is possible that developers may be misled into thinking that the model is more coordinated than it actually is. It makes it clear that. The researchers emphasize that their study should serve as a call to action for the AI research community to study this behavior more deeply and develop appropriate safety measures.