ELMo

How a token is transformed successively over increasing layers of ELMo. At the start, the token is converted to a vector by a linear layer, giving the embedding vector

e_{0}

. In the next layer, a forward LSTM produces a hidden vector

h_{00}

, while a backward LSTM produces another hidden vector

h_{00r}

. In the next layer, the two LSTM produces

h_{10}

and

h_{10r}

, and so on.

ELMo (embeddings from language model) is a word embedding method for representing a sequence of words as a corresponding sequence of vectors.^[1] It was created by researchers at the Allen Institute for Artificial Intelligence,^[2] and University of Washington and first released in February, 2018. It is a bidirectional LSTM which takes character-level as inputs and produces word-level embeddings, trained on a corpus of about 30 million sentences and 1 billion words.

The architecture of ELMo accomplishes a contextual understanding of tokens. Deep contextualized word representation is useful for many natural language processing tasks, such as coreference resolution and polysemy resolution.

ELMo was historically important as a pioneer of self-supervised generative pretraining followed by fine-tuning, where a large model is trained to reproduce a large corpus, then the large model is augmented with additional task-specific weights and fine-tuned on supervised task data. It was an instrumental step in the evolution towards transformer-based language modelling.

^ Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018). "Deep contextualized word representations". arXiv:1802.05365 [cs.CL].
^ "AllenNLP - ELMo — Allen Institute for AI".

[1] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018). "Deep contextualized word representations". arXiv:1802.05365 [cs.CL].

[2] "AllenNLP - ELMo — Allen Institute for AI".

[1]

[2]