CM3: A Causal Masked Multimodal Model of the Internet (Paper Explained w/ Author Interview)
Yannic Kilcher Videos (Audio Only) - A podcast by Yannic Kilcher
 
   Categories:
#cm3 #languagemodel #transformer This video contains a paper explanation and an incredibly informative interview with first author Armen Aghajanyan. Autoregressive Transformers have come to dominate many fields in Machine Learning, from text generation to image creation and many more. However, there are two problems. First, the collected data is usually scraped from the web and uni- or bi-modal and throws away a lot of structure of the original websites, and second, language modelling losses are uni-directional. CM3 addresses both problems: It directly operates on HTML and includes text, hyperlinks, and even images (via VQGAN tokenization) and can therefore be used in plenty of ways: Text generation, captioning, image creation, entity linking, and much more. It also introduces a new training strategy called Causally Masked Language Modelling, which brings a level of bi-directionality into autoregressive language modelling. In the interview after the paper explanation, Armen and I go deep into the how and why of these giant models, we go over the stunning results and we make sense of what they mean for the future of universal models. OUTLINE: 0:00 - Intro & Overview 6:30 - Directly learning the structure of HTML 12:30 - Causally Masked Language Modelling 18:50 - A short look at how to use this model 23:20 - Start of interview 25:30 - Feeding language models with HTML 29:45 - How to get bi-directionality into decoder-only Transformers? 37:00 - Images are just tokens 41:15 - How does one train such giant models? 45:40 - CM3 results are amazing 58:20 - Large-scale dataset collection and content filtering 1:04:40 - More experimental results 1:12:15 - Why don't we use raw HTML? 1:18:20 - Does this paper contain too many things? Paper: https://arxiv.org/abs/2201.07520 Abstract: We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM. We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model. Authors: Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer
