ELMo

Architecture of ELMo. It first processes input tokens into embedding vectors by an embedding layer (essentially a lookup table), then applies a pair of forward and backward LSTMs to produce two sequences of hidden vectors, then apply another pair of forward and backward LSTMs, and so on.
How a token is transformed successively over increasing layers of ELMo. At the start, the token is converted to a vector by a linear layer, giving the embedding vector . In the next layer, a forward LSTM produces a hidden vector , while a backward LSTM produces another hidden vector . In the next layer, the two LSTM produces and , and so on.

ELMo (embeddings from language model) is a word embedding method for representing a sequence of words as a corresponding sequence of vectors.[1] It was created by researchers at the Allen Institute for Artificial Intelligence,[2] and University of Washington and first released in February, 2018. It is a bidirectional LSTM which takes character-level as inputs and produces word-level embeddings, trained on a corpus of about 30 million sentences and 1 billion words.

The architecture of ELMo accomplishes a contextual understanding of tokens. Deep contextualized word representation is useful for many natural language processing tasks, such as coreference resolution and polysemy resolution.

ELMo was historically important as a pioneer of self-supervised generative pretraining followed by fine-tuning, where a large model is trained to reproduce a large corpus, then the large model is augmented with additional task-specific weights and fine-tuned on supervised task data. It was an instrumental step in the evolution towards transformer-based language modelling.

  1. ^ Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018). "Deep contextualized word representations". arXiv:1802.05365 [cs.CL].
  2. ^ "AllenNLP - ELMo — Allen Institute for AI".