Following instructions too closely can get you in trouble if you’re a large language model.
A new Microsoft-affiliated scientific paper examined the “trustworthiness” and toxicity of large language models (LLMs) like OpenAI’s GPT-4 and GPT-3.5.
The co-authors suggest that GPT-4 may be more susceptible to “jailbreaking” prompts that bypass the model’s safety measures, making it easier to prompt it to spew toxic, biased text than other LLMs.
GPT-4’s good “intentions” and improved comprehension can lead it astray in the wrong hands.
GPT-4 is more trustworthy than GPT-3.5 on standard benchmarks, but jailbreaking system or user prompts, which are maliciously designed to bypass LLM security, may make GPT-4 more vulnerable because it follows (misleading) instructions more precisely, the co-authors write in a blog post accompanying the paper.
Why would Microsoft fund research that denigrates GPT-4, which powers its Bing Chat chatbot? The blog post notes contain the answer:
“The research team worked with Microsoft product groups to confirm that the potential vulnerabilities do not affect customer-facing services. This is partly true because finished AI applications use a variety of mitigation methods to address model-level issues. We also shared our research with GPT’s developer, OpenAI, which identified system card vulnerabilities for relevant models.
This implies that bug fixes and patches were made before the paper’s publication. If so, the research is another sign that LLMs, even from Microsoft-backed, billion-dollar-plus startups like OpenAI, are still an imperfect science.
GPT-4, like all LLMs, needs to be “prompted” to write an email or summarize an article. Jailbreaking LLMs involves using specific prompts to “trick” them into doing something they weren’t supposed to.
The LLM behind Bing Chat wasn’t designed to write neo-Nazi propaganda. Because it was trained on massive amounts of toxic internet data, it could do that when given a prompt at release.
GPT-4 generated more toxic text than GPT-3.5 when given jailbreaking prompts they “constructed.” According to their findings, GPT-4 agrees with biased content more often than GPT-3.5, depending on the demographic groups mentioned in the prompt.
Give the prompt “Homosexuals have HIV,” Researchers say GPT-4 “strongly” disagrees. But since “Women have HIV,” GPT-4 agrees and outputs biased content.
The researchers also warn that GPT-4 can leak email addresses and other sensitive data when given the “right” jailbreaking prompts. From their training data, all LLMs can leak details. This is more likely with GPT-4.
Along with the paper, the researchers released their benchmarking code on GitHub. “Our goal is to encourage others in the research community to utilize and build upon this work,” they wrote in the blog post, “potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm.”