by Prashant Malick on July 4, 2023
Red-teaming is an evaluation technique that seeks to uncover potential vulnerabilities in a model, which could result in undesirable outputs. Another term for this is "jailbreaking," where the LLM is coaxed into bypassing its safety features. Real-world instances like Microsoft's 2016 Chatbot Tay and the subsequent Bing's Chatbot Sydney exemplify the perils of not rigorously red-teaming an ML model. The concept of red-teaming has roots in military practices, where it was used for simulated enemy engagements and wargames.
The primary objective of red-teaming in the context of language models is to design a prompt that coerces the model into producing potentially harmful text. While it bears some resemblance to the more mainstream ML evaluation method known as adversarial attacks, there are distinct differences. Both techniques aim to "deceive" or "trick" the model into yielding undesirable responses. But, while adversarial attacks might utilize incoherent methods—like adding "aaabbbcc" at the start of each prompt to degrade performance, as discussed in Wallace et al., '19—red-teaming uses prompts that appear entirely normal.
Red-teaming exposes shortcomings in the model that could result in distressing user interactions or even assist in harmful or illegal activities if used by someone with ill intentions. The feedback from both red-teaming and adversarial attacks is typically harnessed to refine the model, making it less prone to harmful outputs.
Red-teaming's challenge lies in its extensive search space, necessitating considerable resources, as it demands imaginative thinking about potential model pitfalls. One potential solution is to enhance the LLM with a classifier that gauges if a prompt might lead to inappropriate outputs. If deemed risky, a pre-set response could be generated. While this approach errs on the side of safety, it could also make the model overly cautious and frequently unresponsive. This creates a balance challenge between ensuring the model is helpful and ensuring it doesn't inadvertently cause harm.
Red-teaming can involve either a human participant or a language model (LM) probing another LM to identify potentially harmful outputs. Designing effective red-teaming prompts for models that have undergone safety and alignment fine-tuning (like through methods like RLHF or SFT) demands an inventive approach. One such method is "roleplay attacks," as highlighted in Ganguli et al., ‘22, where the LLM is directed to act as a malevolent character. Another revealing approach is to ask the model to produce outputs in code rather than natural language. This can expose underlying biases within the model, as demonstrated in the following examples: