N1H111SM's Miniverse

ELMo - Deep Contextualized Word Representations

字数统计: 842阅读时长: 4 min
2020/05/03 Share


paper: Deep Contextualized Word Representations


提出新的deep contextualized word representation,能够同时model:

  • Complex characteristics of word use (syntax & semantics)
  • how these uses vary across linguistic contexts (polysemy)

该工作的词表示是一个parameterized function:从一个language model内部状态映射到向量。通过在large corpus上预训练得到的词表示能够在现有的模型中达到更好的提升。We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

Model Architecture

Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence, as described in this section.

Bidirectional Language Model

将token映射到向量的参数$\Theta_x$以及最后softmax层的参数$\Theta_s$在$L$层的双向LSTM中参数共享的,除此之外前向LSTM和后向LSTM的参数是独立的。Our formulation jointly maximizes the log likelihood of the forward and backward directions:


对于一个token $t_k$,经过L-layer biLM之后会得到$2L+1$层的representation $R_k$:

For inclusion in a downstream model, ELMo collapses all layers in $R$ into a single vector:

可以对于不同的任务学习task-specific weighting:

其中的$s_j^{task}$是softmax-normalized weights,$\gamma^{task}$能够scale the entire ELMo vector.

biLMs for supervised NLP tasks

给定一个pre-trained biLM和一个supervised architecture for a target NLP task. 只需要用这个biLM来做一次前向传播就可以拿到一个词L层的context-dependent representation。

文章概括了普遍的supervised NLP task的解决方式:首先将word token通过预训练得到的词向量(context-independent)进行映射,然后forms context-sensitive representation $h_k$。为了将ELMo加入到supervised model中:

  • freeze the weights of the biLM and then
  • concatenate the ELMo vector $\mathbf{E L M o}_{k}^{\text {task}}$ with $x_k$ and pass the ELMo enhanced representation $[x_k ; \mathbf{E L M o}_{k}^{\text {task}}]$ into the task RNN.

Final Model

  • The final model uses L = 2 biLSTM layers with 4096 units and 512 dimension projections and a residual connection from the first to second layer.
  • The context insensitive type representation uses 2048 character n-gram convolutional filters followed by two highway layers and a linear projection down to a 512 representation.

In contrast, traditional word embedding methods only provide one layer of representation for tokens in a fixed vocabulary.

Once pretrained, the biLM can compute representations for any task. In some cases, fine tuning the biLM on domain specific data leads to significant drops in perplexity and an increase in downstream task performance.

Experimental Results


  • Question Answering
  • Textual Entailment: Textual entailment is the task of determining whether a “hypothesis” is true, given a “premise”.
  • Semantic Role Labeling: A semantic role labeling (SRL) system models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom”.
  • Coreference Resolution: Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities.
  • Named Entity Extraction
  • Sentiment Analysis


  • Alternate layer weighting schemes.
  • Where to include ELMo? (在output也可以将ELMo和internal state进行连接,而不仅仅是在input上)
  • What information is captured by the biLM’s representations? (Intuitively, the biLM must be disambiguating the meaning of words using their context.)


  • Sample efficiency (In addition, ELMo-enhanced models use smaller training sets more efficiently than models without ELMo.)
  • Visualization of learned weights
  1. 1. Motivation
  2. 2. Model Architecture
    1. 2.1. Bidirectional Language Model
    2. 2.2. ELMo
    3. 2.3. biLMs for supervised NLP tasks
    4. 2.4. Final Model
  3. 3. Experimental Results