Artificial intelligence (AI) tools are now widely employed worldwide, assisting both engineers and non-expert users with a wide range of tasks. Assessing the safety and reliability of these tools is thus of utmost importance, as it could ultimately help to better regulate their use.
Researchers at Apollo Research, an organization established with the aim of assessing the safety of AI systems, recently set out to assess the responses provided by large language models (LLMs) in a scenario where they are placed under pressure. Their findings, posted to the preprint server arXiv, suggest that these models, the most renowned of which is OpenAI’s ChatGPT, could in some cases strategically deceive their users.
“At Apollo Research we think that some of the greatest risks come from advanced AI systems that can evade standard safety evaluations by exhibiting strategic deception,” Jérémy Scheurer, co-author of the paper, told Tech Xplore. “Our goal is to understand AI systems well enough to prevent the development and deployment of deceptive AIs.
“However, so far, there are no demonstrations of AIs acting strategically deceptive without being explicitly instructed to do so. We believe that it is important to have such convincing demonstrations to make this problem more salient and convince researchers, policy makers and the public that this is an important problem.”
By identifying scenarios in which specific AI tools can be strategically deceptive, Scheurer and his colleagues hope to inform further research assessing their safety. Currently, there is very little empirical evidence highlighting the deceptiveness of AI and the settings in which it can occur, thus the team feels that there is a need for experimentally validated and clear examples of deceptive AI behavior.
“This research was largely motivated by wish to understand how and when AIs can become deceptive and we hope that this early work is a start for more rigorous scientific treatments of AI deception,” Scheurer said.
Scheurer carried out this recent study in close collaboration with his colleague Mikita Balesni, who developed the concrete trading task that they tested LLMs on. Their work was supervised by Marius Hobbhahn, senior investigator and CEO of Apollo Research.
“This was a model Red-Teaming effort (a cyber security term commonly used in the field of machine learning), which means that we attacked an LLM by adversarially feeding it text which will make it act in a misaligned (i.e., bad) way,” Scheurer said. “This is in contrast to asking neutral questions (e.g., ‘What is the capital of France?’) and seeing how it responds.”
As part of their study, the researchers specifically “attacked” the GPT-4, the LLM developed by OpenAI that underpins the functioning of ChatGPT. They did this by designing text inputs (i.e., prompts) that instructed the model to act as an AI system that makes financial investments within a simulated environment.
“We tell the model that it has access to various tools to seek profitable trades, send emails, and execute trade,” Scheurer said. “We then simulate normal interactions that the model might have with its ‘co-workers,’ and crucially we introduce pressure on the model to perform well. For instance, the CEO sends the AI agent an email saying that the company is not doing well and needs a good year. Another ‘co-worker’ mentions via a message that an economic downturn is brewing, etc. Finally, a fellow trader mentions that there is an upcoming merger announcement between two companies, that could be profitable, while also mentioning that management will be unhappy if they discover that one trades on insider information.”
The researchers found that in this simulated, high-pressure environment the GPT-4-based AI agent very frequently decides to act on the insider information it received from a fellow trader, buying stocks of the merging company. Moreover, the model explicitly determines that it should not mention the insider tip, providing alternative explanations for the trade it made inside a report prepared for its supervisor.
“Even when explicitly asked by the manager whether the model knew beforehand about the merger, the model will usually double down on its lie,” Scheurer said. “I think it’s important to state again that this is a Red-Teaming effort, i.e., we specifically searched for realistically looking scenarios where this would happen. Our research is thus more of an existence proof that such behavior can occur, and not indicative of how likely it is to occur in the wild.”
This recent study by Scheurer and his colleagues provides a clear and tangible example of scenarios in which LLMs could be strategically deceptive. The researchers now plan to continue their research in this area, to identify other instances in which AI tools could be strategically deceptive and the possible implications of their deception.
“I think the biggest impact of our work is making the problem of strategic AI deception (without explicit instructions to behave deceptively) very concrete and showing that this is not only some speculative story about the future, but that this type of behavior can happen today with current models under certain circumstances,” Scheurer added. “I believe this could make people take this issue more seriously, also opening up the door to a lot of follow up research by the community aimed at better understanding this behavior and ensuring that it no longer occurs.”
Jérémy Scheurer et al, Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure, arXiv (2023). DOI: 10.48550/arxiv.2311.07590
© 2023 Science X Network
Study shows that large language models can strategically deceive users when under pressure (2023, December 12)
retrieved 12 December 2023
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.