# ELMo - Deep Contextualized Word Representations

# Motivation

• Complex characteristics of word use (syntax & semantics)
• how these uses vary across linguistic contexts (polysemy)

# Model Architecture

Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence, as described in this section.

## ELMo

For inclusion in a downstream model, ELMo collapses all layers in $R$ into a single vector：

## biLMs for supervised NLP tasks

• freeze the weights of the biLM and then
• concatenate the ELMo vector $\mathbf{E L M o}_{k}^{\text {task}}$ with $x_k$ and pass the ELMo enhanced representation $[x_k ; \mathbf{E L M o}_{k}^{\text {task}}]$ into the task RNN.

## Final Model

• The final model uses L = 2 biLSTM layers with 4096 units and 512 dimension projections and a residual connection from the first to second layer.
• The context insensitive type representation uses 2048 character n-gram convolutional filters followed by two highway layers and a linear projection down to a 512 representation.

In contrast, traditional word embedding methods only provide one layer of representation for tokens in a fixed vocabulary.

Once pretrained, the biLM can compute representations for any task. In some cases, fine tuning the biLM on domain specific data leads to significant drops in perplexity and an increase in downstream task performance.

# Experimental Results

• Textual Entailment: Textual entailment is the task of determining whether a “hypothesis” is true, given a “premise”.
• Semantic Role Labeling: A semantic role labeling (SRL) system models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom”.
• Coreference Resolution: Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities.
• Named Entity Extraction
• Sentiment Analysis

• Alternate layer weighting schemes.
• Where to include ELMo? （在output也可以将ELMo和internal state进行连接，而不仅仅是在input上）
• What information is captured by the biLM’s representations? （Intuitively, the biLM must be disambiguating the meaning of words using their context.）

• Sample efficiency (In addition, ELMo-enhanced models use smaller training sets more efficiently than models without ELMo.)
• Visualization of learned weights