Models that prove their own correctness

Workshop

Alignment, Trust, Watermarking, and Copyright Issues in LLMs

Speaker(s)

Orr Paradise (UC Berkeley)

Location

Calvin Lab Auditorium

Date

Tuesday, Oct. 15, 2024

Time

11:45 a.m. – 12:30 p.m. PT

Abstract

This talk introduces Self-Proving models, a new class of models that formally prove the correctness of their outputs via an Interactive Proof system. After reviewing some related literature, I will formally define Self-Proving models and their per-input (worst-case) guarantees. I will then present algorithms for learning these models and explain how the complexity of the proof system affects the complexity of the learning algorithms. Finally, I will show experiments where Self-Proving models are trained to compute the Greatest Common Divisor of two integers, and to prove the correctness of their results to a simple verifier. Joint work with Noga Amit, Shafi Goldwasser, and Guy N. Rothblum.

Attachment

LLM24-2 Slides - Orr Paradise.pdf