Robust Generalization in the Era of LLMs: Jailbreaking Attacks and Defenses

Workshop

Emerging Generalization Settings

Speaker(s)

Hamed Hassani (University of Pennsylvania)

Location

Calvin Lab Auditorium

Date

Wednesday, Sept. 11, 2024

Time

9 – 10 a.m. PT

Abstract

Despite efforts to align large language models (LLMs) with human intentions, popular LLMs such as GPT, Llama, Claude, and Gemini are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. For this reason, interest has grown in improving the robustness of LLMs against such attacks. In this talk, we review the current state of the jailbreaking literature, including new questions about robust generalization, discussions of new black-box attacks on LLMs, defenses against jailbreaking attacks, and a new leaderboard to evaluate the robust generalization of production LLMs.

Robust Generalization in the Era of LLMs: Jailbreaking Attacks and Defenses

Abstract

Video Recording