The Spy Tech of AI: When Models Learn to Self-Destruct

Mission AI-Possible – Photo By Author – David E. Sweenor

Introduction

In the popular Mission Impossible TV series and subsequent Tom Cruise movies, we all probably remember the beginning–they consistently used the phrase “this message will self-destruct in five seconds.” What was the purpose of this? Well, to erase data, for sure! If the instructions were to fall into the wrong hands, humanity would suffer, and our hero would succumb to his adversaries.

Now imagine a scene straight out of a Mission Impossible movie where generative artificial intelligence (AI) models self-destruct if they’re tampered with–this isn’t just a Hollywood movie anymore but an AI reality.

On October 30, 2023, the U.S. passed the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, a first-of-its-kind directive (in the U.S.) aimed at navigating the generative AI ecosystem.^[1] There were several provisions in the Executive Order (EO), which I shared in my last Medium post, Generative AI’s Powers and Perils: How Biden’s Executive Order is Reshaping the Tech Landscape. One phrase from the accompanying Fact Sheet stated, “The Departments of Energy and Homeland Security will also address AI systems’ threats to critical infrastructure, as well as chemical, biological, radiological, nuclear, and cybersecurity risks.”^[2] Stanford University’s Human-Centered Artificial Intelligence published a fantastic article and line item tracker that tracks all 150 requirements outlined in the EO.^[3] They found that 15% of the requirements were related to safety.

So, given that large language models (LLMs) are trained on data that was indiscriminately collected from the internet, do they pose an unnecessary risk? Most of the generative AI providers are implementing AI guardrails, but what about open-source models where the weights are available? Or the questionably labeled open-sourced 70B parameter Llama 2 model released by Meta?^[4]

Background

In my Medium article, Generative AI’s Force Multiplier: Your Data, we discussed the various approaches to adding context (your proprietary data) to LLM. There were four approaches discussed, which we will briefly review below. Figure 1.1 is a graph of external data required vs. domain understanding and what approach to use.

Figure 1.1: Adding Context to LLMs

Prompt engineering: this is a specific way to format your natural language question to maximize your probability of getting the output in the desired format.
Fine-tuning: involves feeding the LLM examples of your data and expected outputs to modify the outermost weights on the model. This is commonly done to adapt the model for specific purposes.
Retrieval augmented generation (RAG): RAG is a method to connect an external data source, commonly a graph and/or vector database, that allows the generative AI system to access and use your proprietary data in its responses without actually modifying the underlying model.

Based on recently published research, this article highlights potential safety concerns, risks, and considerations with fine-tuning models. This article makes no attempt to describe the specific technological mechanisms that create these risks–you can read the technical papers for that. But rather, I am surfacing them for awareness so your business can make informed decisions on how to best mitigate them.

Understanding AI Safety and Customization

As previously mentioned, generative AI service providers like OpenAI, Bard, Anthropic, and others have added AI safety guardrails to their applications. The guardrails are designed to mitigate bias’, stereotypes, malicious intent, cybersecurity risks, copyright infringement, and all-around shenanigans by bad actors. Now, for organizations that choose to fine-tune their own models, are they exposing themselves to undue risk?

In the research paper Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!, the researchers discovered that the very act of fine-tuning a LLM can have deleterious effects on model safety.^[5]Through red team testing, the team was able to “jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI’s APIs, making the model responsive to nearly any harmful instructions.”^[6] After which, the AI safety guardrails were bypassed and the model would then create malicious code and freely provide detailed instructions on how to make a bomb. The team suggests this wasn’t specific to the one model-this was pervasive across a number of models tested. The team also noted that this “safety degradation” occurred whether they were intentionally designing a malicious prompt or unintentionally.

Figure 1.2: Safety Degradation of LLMs After Fine-tuning^[7]

Safety Degradation in LLMs After Fine-tuning

The paper’s authors propose the following mitigation strategies; the elements in quotes are direct headings from the paper:

Enhanced “pre-training and alignment”: use meta-learning and specific pre-training data to make models resistant to detrimental fine-tuning.
“Fine-tuning data moderation”: implement moderation systems to inspect fine-tuning datasets (which may have limitations).
Post-fine-tuning safety audits: develop automated red-teaming tests to ensure models meet safety criteria after fine-tuning. This is more straightforward in closed-source models but challenging in open-source contexts due to potential backdoor attacks.
Mixing safety data during fine-tuning: Blending safety data with the fine-tuning dataset can enhance model safety but doesn’t completely create alignment at the level of the original model.

Now that we’ve explored the risks, what about self-destructing LLMs?

Self-Destructing Models

Sounding like something directly from Mission Impossible, To make it a bit more difficult for the bad actors, a team of researchers released a paper titled Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models, where they outlined an approach, albeit nascent, to help mitigate and prevent nefarious actors.^[8]

In their paper, the team outlined several mechanisms that have been proposed to prevent the use of LLMs for ill intent.

Below are strategies to prevent ill-intended uses of LLMs:^[9]

Export controls

Example: U.S. export controls on AI compute hardware (e.g., Nvidia GPUs), software, and models.

Difficulties: Hard to enforce, compute hardware barriers are lower, and precision in definitions is a challenge.

Access control

Example: Limits the release of models to API-only access with specific agreements that require vetting (e.g., approach used by Meta, Llama).

Difficulties: Open-source models remain problematic, difficult to monitor, and code/model leaks (e.g., Llama on BitTorrent).

Licensing controls

Example: OpenRail, terms of service that restrict how models are used.

Difficulties: Monitoring and enforcement are problematic. Bad actors will not adhere to these.

Filtering and alignment

Example: AI guardrails and reinforcement learning from human feedback (RLHF).

Difficulties: Can be circumvented with prompt engineering and fine-tuning (as discussed above).

Realizing that these mitigation strategies didn’t go far enough, the team set out to develop another technical mitigation strategy. Thus, the team developed a task-blocking strategy dubbed Meta-Learned Adversarial Censoring (MLAC), which is also known as, self-destructing models. Fundamentally, the entire goal of this approach is to increase the amount of data, compute cycles, and expertise required to fine-tune models. The theory is that if it requires more data, computational costs, and talent–fewer people would be willing to embark on this endeavor.

The experiments conducted by the research team demonstrated that the MLAC technique can make it more difficult for bad actors to repurpose models designed for good for malicious intent. However, they would fully acknowledge this is just another arrow in the quiver and more research and experimentation is needed.

In the paper, the research team identified four ethical considerations for this approach:

What constitutes harmful use? By its very nature, anticipating what is harmful inherently involves value judgments and potential biases. When creating self-destructing models, developers must decide which tasks to block, considering ethical, cultural, and societal norms. This is the essence of ‘red teaming’, but it certainly raises questions about whose values and bias’ win the day.
Collecting harmful data can negatively impact humans: To effectively block harmful tasks, data representing these tasks is required. Collecting and aggregating this data poses its own risks, including potential harm to the well-being of annotators or security research members involved in the process. Steps must be taken to ensure that developing safer models doesn’t pose threats to the doers.
Self-destructing systems aren’t the be-all, end-all: There is a risk of overconfidence in the self-destructing mechanism as a safety solution. While this approach adds a valuable tool to the safety toolkit, it does not completely prevent manipulation for every harmful task. Similar to fraudsters, the bad guys are also always innovating and will find ways around these protections. Relying on more data, costs, and expertise is tenuous at best–bad actors always find the means.
Balancing innovation and safety: Implementing self-destructing models raises the question of balancing innovation with the need for safety. Overly restrictive measures might stifle innovation and the development of beneficial AI applications. Conversely, a lack of adequate safety measures can lead to the misuse of AI technologies.

Now that we understand the state of the art, what’s next?

Summary

As with any new technology, business leaders and practitioners need to watch the development of generative AI systems closely. Given the 150 line items in the EO, the market is ripe for change. In the end, to safeguards will be implemented in the form of regulations, technical solutions, and policy determinations. Organizations can prepare by staffing up, reskilling, and upskilling cross-functional tiger teams to implement a set of governance policies for the safe and effective use of AI for years to come.

Warning: this blog will self-destruct in 3 minutes.

If you enjoyed this article, please like the article, highlight interesting sections, and share comments. Consider following me on Medium and LinkedIn.

If you’re interested in this topic, consider TinyTechGuides’ latest report, The CIO’s Guide to Adopting Generative AI: Five Keys to Success or Artificial Intelligence: An Executive Guide to Make AI Work for Your Business.

^[1] “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.” 2023. The White House. October 30, 2023. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/.

^[2] “FACT SHEET: President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence.” 2023. The White House. October 30, 2023. https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/.

^[3] Meinhardt, Caroline, Christie M. Lawrence, Lindsey A. Gailmard, Daniel Zhang, Rishi Bommasani, Rohini Kosoglu, Peter Henderson, Russell Wald, and Daniel E. Ho. 2023. “By the Numbers: Tracking the AI Executive Order.” Hai.stanford.edu. November 16, 2023. https://hai.stanford.edu/news/numbers-tracking-ai-executive-order.

^[4] Nolan, Michael. 2023. “Llama and ChatGPT Are Not Open-Source – IEEE Spectrum.” Spectrum.ieee.org. July 27, 2023. https://spectrum.ieee.org/open-source-llm-not-open.

^[5] Qi, Xiangyu, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” ArXiv.org. October 5, 2023. https://doi.org/10.48550/arXiv.2310.03693.

^[6] Qi, Xiangyu, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” ArXiv.org. October 5, 2023. https://doi.org/10.48550/arXiv.2310.03693.

^[7] Qi, Xiangyu, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. “Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!” ArXiv.org. October 5, 2023. https://doi.org/10.48550/arXiv.2310.03693.

^[8] Henderson, Peter, Eric Mitchell, Christopher D. Manning, Dan Jurafsky, and Chelsea Finn. 2023. “Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models.” ArXiv.org. August 8, 2023. https://doi.org/10.48550/arXiv.2211.14946.

^[9] Henderson, Peter, Eric Mitchell, Christopher D. Manning, Dan Jurafsky, and Chelsea Finn. 2023. “Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models.” ArXiv.org. August 8, 2023. https://doi.org/10.48550/arXiv.2211.14946.