The fact that an AI model has the potential to behave in a deceptive manner without any direction to do so may seem concerning. But it mostly arises from the “black box” problem that characterizes state-of-the-art machine-learning models: it is impossible to say exactly how or why they produce the results they do—or whether they’ll always exhibit that behavior going forward, says Peter S. Park, a postdoctoral fellow studying AI existential safety at MIT, who worked on the project.
“Just because your AI has certain behaviors or tendencies in a test environment does not mean that the same lessons will hold if it’s released into the wild,” he says. “There’s no easy way to solve this—if you want to learn what the AI will do once it’s deployed into the wild, then you just have to deploy it into the wild.”
Our tendency to anthropomorphize AI models colors the way we test these systems and what we think about their capabilities. After all, passing tests designed to measure human creativity doesn’t mean AI models are actually being creative. It is crucial that regulators and AI companies carefully weigh the technology’s potential to cause harm against its potential benefits for society and make clear distinctions between what the models can and can’t do, says Harry Law, an AI researcher at the University of Cambridge, who did not work on the research.“These are really tough questions,” he says.
Fundamentally, it’s currently impossible to train an AI model that’s incapable of deception in all possible situations, he says. Also, the potential for deceitful behavior is one of many problems—alongside the propensity to amplify bias and misinformation—that need to be addressed before AI models should be trusted with real-world tasks.
“This is a good piece of research for showing that deception is possible,” Law says. “The next step would be to try and go a little bit further to figure out what the risk profile is, and how likely the harms that could potentially arise from deceptive behavior are to occur, and in what way.”
Recent Comments