Discrete Optimization for Adversarial Attacks on Large Language Models

9/21/23 | 4:15pm | E51-335

Zico Kolter

Associate Professor
Carnegie Mellon

Abstract: In this talk, I’ll discuss our recent work on adversarial attacks against public large language models (LLMs), such as ChatGPT and Bard. At a high level, the attacks look for “adversarial suffix” strings that cause these models to ignore their guardrails and answer potentially harmful user queries. This talk will specifically focus on the optimization aspects of this problem, where the task at hand involves a relatively unstructured optimization over discrete objects (the tokens in the adversarial suffix). I will highlight the challenges of this problem from an optimization standpoint, and highlight the main features of our method, which combines gradient-based information and with greedy search. I will highlight potential future directions for research in such optimization settings, as well as discuss the broader implications on LLM robustness.

Bio: Zico Kolter is an Associate Professor in the Computer Science Department at Carnegie Mellon University, and also serves as chief scientist of AI research for the Bosch Center for Artificial Intelligence. His work spans the intersection of machine learning and optimization, with a large focus on developing more robust and rigorous methods in deep learning. In addition, he has worked in a number of application areas, highlighted by work on sustainability and smart energy systems. He is a recipient of the DARPA Young Faculty Award, a Sloan Fellowship, and best paper awards at NeurIPS, ICML (honorable mention), AISTATS (test of time), IJCAI, KDD, and PESGM.