From Risk to Resilience: Risk Assessment, Safety Alignment, and Guardrails for Foundation Models

Abstract

Foundation models have garnered widespread attention due to their impressive performance across a range of applications. However, our understanding of the trustworthiness and risks of these models remains limited. The temptation to deploy proficient foundation models in sensitive domains like healthcare and finance, where errors carry significant consequences, underscores the need for rigorous safety evaluations, enhancement, and guarantees. Recognizing the urgent need for developing safe and beneficial AI, our recent research seeks to design a unified platform to evaluate the safety of LLMs and multimodal foundation models (MMFM) from both regulatory compliance risk assessment principle, and use-case-driven principle such as toxicity, stereotype bias, adversarial robustness, OOD robustness, ethics, privacy, and fairness. Based on our understanding of risks of FMs, we have been focused on enhancing their safety through knowledge integration and provide safety guardrail and certifications. In this talk, I will first outline our foundational principles for safety evaluation, detail our red teaming tactics, and share insights gleaned from evaluating foundation models on our DecodingTrust, AIR, and MMDT platforms for LLMs and MMFM, including close-source, open-source, and compressed models. Further, I will delve into our work on enhancing model safety, such as hallucination mitigation. I will also explain how knowledge integration helps align models and prove that the RAG framework achieves provably lower conformal generation risks compared to vanilla LLMs. Finally, I will briefly discuss our efficient and resilient guardrail framework for risk mitigation in practice.