Language models are injective and hence invertible

Best AI papers explained - A podcast by Enoch H. Kang

Try Bookbeat 60! days for free, click here

Enjoy a whole world of audiobooks and e-books, everything from new releases to the classics

Categories:

The academic paper argues that decoder-only Transformer language models, such as GPTs, are almost surely injective, meaning that distinct input prompts map to distinct internal hidden states, preserving input information without loss. This contrasts with the common assumption that non-linear components make models lossy. The authors mathematically prove that this injectivity is a structural property established at initialization and preserved during standard training procedures like gradient descent. To exploit this finding, the paper introduces SIPIT (Sequential Inverse Prompt via ITerative updates), an algorithm demonstrated to efficiently and exactly reconstruct the original input text from the model’s hidden activations, achieving 100% accuracy in linear time across empirical tests on state-of-the-art models. Ultimately, the work establishes invertibility as a foundational and exploitable property of these models, with implications for interpretability and safety.

Visit the podcast's native language site