Attention Is All You Need
Introduces the Transformer architecture, which has become the foundation for many modern LLMs.
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Presents BERT, a bidirectional transformer model that revolutionized NLP tasks.
Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Language Models are Few-Shot Learners
Demonstrates the capability of large language models to perform tasks with minimal examples.
Authors: Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Proposes a unified framework for various NLP tasks using a text-to-text approach.
Authors: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Presents an optimized method for pretraining BERT models, achieving state-of-the-art results.
Authors: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov
LLaMA: Open and Efficient Foundation Language Models
Introduces LLaMA, a series of foundation language models that are open and efficient.
Authors: Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Raphaël Gontijo Lopes, Timothy Dettmers, Myle Ott, Francisco Massa, Alexandre Défossez, Timothy Lewis, Angela Fan, Naman Goyal, Edouard Grave, Michael Auli, Armand Joulin
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Presents BLOOM, a large multilingual language model with 176 billion parameters.
Authors: BigScience Workshop
PaLM: Scaling Language Modeling with Pathways
Introduces PaLM, a large language model that scales language modeling with the Pathways system.
Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Katherine Lee, Noam Shazeer, Megan N. Smith, Jared Kaplan, Nan Ding, Thang Luong, Quoc V. Le
Training Compute-Optimal Large Language Models
Discusses strategies for training large language models in a compute-optimal manner.
Authors: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, Ilya Sutskever
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Introduces FlashAttention, a method for fast and memory-efficient exact attention with IO-awareness.
Authors: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
LoRA: Low-Rank Adaptation of Large Language Models
Presents LoRA, a method for low-rank adaptation of large language models.
Authors: Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Weizhu Chen
The Llama 3 Herd of Models
Introduces the Llama 3 Herd of Models, a series of models designed for various NLP tasks.
Authors: Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Raphaël Gontijo Lopes, Timothy Dettmers, Myle Ott, Francisco Massa, Alexandre Défossez, Timothy Lewis, Angela Fan, Naman Goyal, Edouard Grave, Michael Auli, Armand Joulin