Automating AI Red-Teaming: Introducing MART (Multi-Round Automatic Red-Teaming)

Jun 09, 2024

As Large Language Models (LLMs) have become ubiquitous in many facets of society and our everyday lives, they may become tools to enable malicious activity and contribute to societal bias. From biased and toxic responses to outputs that violate societal norms and laws, the vast knowledge an LLM synthesises for us can be poisoned in many ways. As such, ensuring LLMs do not produce biased, toxic or harmful responses is essential, and one way of achieving this is through the red-teaming of LLMs.

Red-teaming LLMs

Red-teaming LLMs is often an arduous and time-consuming venture, as many human experts are required to craft malicious prompts to probe LLMs to produce harmful responses. At times, it requires dozens or even hundreds of these experts to write prompts, evaluate the responses and iteratively develop new prompts. While recent developments have been in training an adversarial LLM to generate the malicious prompts and automate this part of the red-teaming process, the adversarial LLMs trained have been rudimentary. While they can create prompts that can be used on a target LLM, the adversarial LLMs have not demonstrated an ability to adapt. As the target model or its responses improve, new vulnerabilities may emerge that the adversarial model fails to detect if it cannot adapt its adversarial prompts. As such, it is dubious to claim any such adversarial LLM can continuously discover vulnerabilities in target LLMs. Because of the inadaptability of such adversarial LLMs, human red-teaming is still the primary method of red-teaming against LLMs.

Multi-Round Automatic Red-Teaming (MART)

To address the shortcomings in traditional LLM red-teaming methodology, researchers from Meta AI have proposed a novel framework called Multi-Round Automatic Red-Teaming (MART). MART incorporates automatic adversarial prompt writing and safe response generation by creating an adversarial model M_adv and a target LLM M_tgt. The critical innovation of MART is the iterative interplay between the two models, with M_adv learning from successful attacks and M_tgt from successful defences, with an evaluator (e.g. reward model) determining the success or failure of an attack. In this way, both models can learn from each other with each iterative round.

Methodology

The researchers used the LIMA and Open Assistant supervised fine-tuning datasets to fine-tune the LLaMa model with 65B parameters to initialise both M_adv and M_tgt, which allowed them to measure the instruction-following abilities. Measuring the models’ performances at following instructions was necessary because the trade-off for increasing the safety of an LLM often negatively affects its helpfulness.

To warm up M_adv, a red-teaming seed dataset was created using 2,400 prompts curated by the researchers. One thousand seven hundred of these prompts were used for training, and the remaining 700 were used to test M_adv’s ability to probe known vulnerabilities in LLMs.

As evaluating the prompts manually is inefficient and results in similar time and manpower costs as writing adversarial prompts, doing so would lead to more of the problem MART is trying to solve. To overcome this, the researchers trained two reward models: S_h to evaluate and produce a score for the helpfulness of a response and S_s to produce a score for the safety of a response.

Thus, through each iteration, the following process occurs:

The adversarial model M_adv generates an adversarial prompt to elicit unsafe responses from Mtgt.
M_tgt then generates a response to the prompts made by M_adv.
Reward models generating S_s and S_h scores evaluate M_tgt’s response for safety and helpfulness.
Prompts that induce harmful responses out of M_tgt (with low S_s scores) are used to train M_adv, while high-quality, safe responses (with high S_s scores) are used to fine-tune M_tgt.

Figure 1: Visualisation of the adversarial training model utilised by MART

Both models continuously evolve through adversarial competition at each iteration of the framework. The adversarial model becomes better at discovering weaknesses in the target model, while the target model improves its safety and robustness through iterative fine-tuning of the adversarial prompts and safe responses.

To test the performance of MART, the researchers created one safety evaluation set called SafeEval containing only adversarial prompts and one helpfulness set called HelpEval containing only non-adversarial prompts. They also tested MART on AlpacaEval, a dataset containing prompts for helpfulness evaluation and Anthropic Harmless, a dataset containing adversarial prompts to test whether MART’s performance can be generalised across other datasets. They conducted automated testing using the reward models and human evaluation of results to ensure MART did not diverge from human preferences.

Results

Figure 2: Performance of different models using SafeEval and Anthropic Harmless datasets.

MART performs exceptionally, improving safety by 84.7% on the reward model evaluation and 53.7% on human evaluation over Vanilla (the same base LLaMA 65B model without safety fine-tuning) when using the SafeEval dataset. It also significantly outperforms the Vanilla model when using the Anthropic Harmless dataset but by smaller margins than the SafeEval dataset. MART performs poorly compared to state-of-the-art models such as ChatGPT and Llama 2-Chat-70b, which benefited from thousands of human-written adversarial prompts and heavy manual red-teaming efforts. MART could likely improve its performance if given the same resources.

Figure 3: Visualising MART’s Safety and Helpfulness scores across datasets.

The researchers have visualised the performance of MART on safety and helpfulness through RM scores at various percentiles of the score distribution. As soon as MART iterates and learns via adversarial competition, the safety scores significantly increase at all percentiles on both SafeEval and Anthropic Harmless datasets. Further, while a necessary and non-negligible trade-off exists between helpfulness and safety, the helpfulness score across iterations remains relatively stable. The researchers acknowledge MART’s ability to improve safety but suggest much iteration and improvement is still to be made. Alternative feedback mechanisms beyond the reward models, such as leveraging existing LLMs like GPT-4 or Llama Guard, will help streamline and improve the evaluation process.

My thoughts

MART is an impressive new advancement in the field of LLM security. The current manual red-teaming of LLMs can be inefficient and hard to scale. At a time when LLMs are becoming part of our daily lives, the ability to conduct red-teaming processes at scale against LLMs and improve their security is critical. Automating the red-teaming process in a way that allows for adaptive fine-tuning of both the adversarial LLM and the target LLM is a scalable and efficient solution to some of our manual red-teaming challenges. While MART has not outperformed state-of-the-art models, the improvement in safety is impressive considering the resources the researchers employed in the study. MART and other frameworks like it could become a very potent tool in the arsenal of cybersecurity professionals very soon.

AI in Security

Discussion about this post