Original author(s) | Google AI |
---|---|
Initial release | October 31, 2018 |
Repository | https://github.com/google-research/bert |
Type | |
License | Apache 2.0 |
Website | arxiv |
Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google.[1][2] It learned by self-supervised learning to represent text as a sequence of vectors. It had the transformer encoder architecture. It was notable for its dramatic improvement over previous state of the art models, and as an early example of large language model. As of 2020[update], BERT was a ubiquitous baseline in natural language processing (NLP) experiments.[3]
BERT is trained by masked token prediction and next sentence prediction. As a result of this training process, BERT learns contextual, latent representations of tokens in their context, similar to ELMo and GPT-2.[4] It found applications for many many natural language processing tasks, such as coreference resolution and polysemy resolution.[5] It is an evolutionary step over ELMo, and spawned the study of "BERTology", which attempts to interpret what is learned by BERT.[3]
BERT was originally implemented in the English language at two model sizes, BERTBASE (110 million parameters) and BERTLARGE (340 million parameters). Both were trained on the Toronto BookCorpus[6] (800M words) and English Wikipedia (2,500M words). The weights were released on GitHub.[7] On March 11, 2020, 24 smaller models were released, the smallest being BERTTINY with just 4 million parameters.[7]