N1H111SM's Miniverse

On MI Maximization for Representation Learning

字数统计: 1.8k阅读时长: 9 min
2020/05/22 Share



In this paper we argue, and provide empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators.

Structure of Introduction

  • What is MI: definition
  • Fundamental properties of MI
    • Firstly, MI is invariant under reparametrization of the variables - namely, if $X^\prime = f_1(X)$ and $Y^\prime = f_2(Y)$ are homeomorphisms (i.e. smooth invertible maps), then $I (X ; Y ) = I (X ^\prime ; Y ^\prime )$.
    • Secondly, estimating MI in high-dimensional spaces is a notoriously difficult task, and in practice one often maximizes a tractable lower bound on this quantity.
    • Any distribution-free high-confidence lower bound on entropy requires a sample size exponential in the size of the bound.
  • In fact, we show that maximizing tighter bounds on MI can result in worse representations.
  • In addition, we establish a connection to deep metric learning and argue that this interpretation may be a plausible explanation of the success of the recently introduced methods.

Multi-view formulation of the MI maximization

In the image classification setting, for a given image $X$, let $X^{(1)}$ and $X^{(2)}$ be different, possibly overlapping views of $X$, for instance the top and bottom halves of the image. These are encoded using encoders $g_1$ and $g_2$ respectively, and the MI between the two representationsg $g_1(X^{(1)})$ and $g_2(X^{(2)})$ is maximized,

where $\hat I$ represents the sample based MI estimator of the true MI $I(X;Y)$ and the function classes $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ can be used to specify structural constraints on the encoders. One thing to note here is that two encoders $g_1$ and $g_2$ often share parameters.

It gives us plenty of modeling flexibility, as the two views can be chosen to capture completely different aspects and modalities of the data, for example:

  • In the basic form of DeepInfoMax $g_1$ extracts global features from the entire image $X^{(1)}$ and $g_2$ local features from image patches $X^{(2)}$, where $g_1$ and $g_2$ correspond to activations in different layers of the same convolutional network. (也就是上一节介绍的global/local-MI Maximization)
  • Contrastive multiview coding (CMC) generalizes the objective in to consider multiple views $X^{(i)}$, where each $X^{(i)}$ corresponds to a different image modality (e.g., different color channels, or the image and its segmentation mask).
  • Contrastive predictive coding(CPC) incorporates a sequential component of the data. Concretely, one extracts a sequence of patches from an image in some fixed order, maps each patch using an encoder, aggregates the resulting features of the first $t$ patches into a context vector, and maximizes the MI between the context and features extracted from the patch at position $t+k$

MI Estimators


其中$\{(x_i, y_i)\}_{i=1}^K$是joint distribution中采样出来的$K$个样本. 背后的动机就是想要训练一个discriminator来区分对应的$(x_i, y_i)$和不相互对应的$(x_i, y_j)$.

KL divergence from NWJ


Biases in Approximate MI Maximization

组成当前approximate MI maximization for representation learning方法的两个部分为以下

  • Encoder: influenced by network structures / parameters.
  • Critics: the function tells between the
  • Estimators: the lower bound to optimize.

围绕这两个部分文章做了四方面的实验探究这两个部分对approximate MI maximization学到的representation quallity的影响,为以下:

  • encoders are bijective. (not reported in this blog)
  • encoders that can model both invertible and non-invertible functions. (not reported in this blog)
  • different critics: simple / loose bound would lead to high-capacity critics.
  • encoder is more important than architectures.


Motivation: Our goal is to provide a minimal set of easily reproducible empirical experiments to understand the role of MI estimators, critic and encoder architectures when learning representations via the objective.

Dataset: To this end, we consider a simple setup of learning a representation of the top half of MNIST handwritten digit images. Try to maximize the mutual information of the representation between the representation of the top half and the bottom half of the images. Using bilinear critic: $f(x,y)=x^TWy$.

Evaluation: Following the widely adopted downstream linear evaluation protocol.

Higher Capacity Critics: Worse Downstream Tasks

In the previous section we have established that MI and downstream performance are only loosely connected. Clearly, maximizing MI is not sufficient to learn good representations and there is a non-trivial interplay between the architectures of the encoder, critic, and the underlying estimators.

这部分的内容将会主要研究critic architecture如何影响representation quality. Normally, a higher capacity critic should allow for a tighter lower-bound on MI, while this section shows that looser bounds (i.e., simpler critic function) lead to better representation quality.

Compared 3 different architectures of the critic functions:

  • a bilinear critic,
  • a separable critic $f(x,y) = \phi_1(x)^⊤\phi_2(y)$ ($\phi_1$, $\phi_2$ are MLPs with a single hidden layer with 100 units and ReLU activations, followed by a linear layer with 100 units; comprising 40k parameters in total)
  • an MLP critic with a single hidden layer with 200 units and ReLU activations, applied to the concatenated input [x, y] (40k trainable parameters).

实验结果如下所示,支持了”looser bounds (i.e., simpler critic function) lead to better representation quality“ 的结论.


Encoder Architecture: More Important than Specific Estimator

简单概括这一部分的实验,作者比较了两个encoder architecture: ConvNet和MLP,分别在$I_{NCE}$和$I_{NWJ}$这两个estimator上测试下游任务accuracy. 发现encoder architecure的影响比estimator要更大。

To ensure that both network architectures achieve the same lower bound $I_{EST}$ on the MI, we minimize $L_{t}\left(g_{1}, g_{2}\right)=| I_{\mathrm{EST}}\left(g_{1}\left(X^{(1)}\right) ; g_{1}\left(X^{(2)}\right)\right)-t |$ instead of solving original information maximization problem, for two different values $t = 2, 4$.


实验结果如下图所示. 值得注意的是,实验图给出的曲线看起来像是有一个提升的过程,但是和他们的起始点比较提升是相对marginal的。同时我们有理由推测作者将MNIST作为主要的实验数据集是因为InfoMax在更为复杂的数据集上表现得非常不理想:例如在附录G中的CIFAR-10,作者汇报的结果为以下图示,可以看到表现并不理想,下游任务的提升非常微小。


Deep Metric Learning

Given sets of triplets, namely an anchor point $x$, a positive instance $y$, and a negative instance $z$, the goal is to learn a representation $g(x)$ such that the distances between $g(x)$ and $g(y)$ is smaller than the distance between $g(x)$ and $g(z)$, for each triplet. 从这个角度也就不难看出为什么会称MI maximization中的函数为critic function. 而InfoNCE又可以被写作:

后面这一项也就是希望训练一个metric function使得正负样本之间的distance差距能够尽量变大。

Future Work

Alternative measures of information

While MI has appealing theoretical properties, it is clearly not sufficient for this task—it is hard to estimate, invariant to bijections and can result in suboptimal representations which do not correlate with downstream performance. Therefore, a new notion of information should account for both the amount of information stored in a representation and the geometry of the induced space necessary for good performance on downstream tasks. One possible avenue is to consider extensions to MI which explicitly account for the modeling power and computational constraints of the observer, such as the recently introduced F-information.

Going beyond the widely used linear evaluation protocol

While it was shown that learning good representations under the linear evaluation protocol can lead to reduced sample complexity for downstream tasks (Arora et al., 2019), some recent works (Bachman et al., 2019; Tian et al., 2019) report marginal improvements in terms of the downstream performance under a non-linear regime. Related to the previous point, it would hence be interesting to further explore the implications of the evaluation protocol, in particular its importance in the context of other design choices. We stress that a highly-nonlinear evaluation framework may result in better downstream performance, but it defeats the purpose of learning efficiently transferable data representations.

  1. 1. Motivation
    1. 1.1. Structure of Introduction
    2. 1.2. Multi-view formulation of the MI maximization
    3. 1.3. MI Estimators
      1. 1.3.1. InfoNCE
      2. 1.3.2. KL divergence from NWJ
  2. 2. Biases in Approximate MI Maximization
    1. 2.1. Setup
    2. 2.2. Higher Capacity Critics: Worse Downstream Tasks
    3. 2.3. Encoder Architecture: More Important than Specific Estimator
  3. 3. Deep Metric Learning
  4. 4. Future Work
    1. 4.0.1. Alternative measures of information
    2. 4.0.2. Going beyond the widely used linear evaluation protocol