N1H111SM's Miniverse

Mutual Information Maximization - Theory

字数统计: 1.6k阅读时长: 9 min
2020/05/10 Share

Materials

MI Definition

Recall the mutual information can be computed by information entropy or by KL divergence, the latter of which is often used for estimation.

MI Estimation

When it’s for the discrete variables, mutual information is easy to compute. But when in the continuous case, the standard approach to estimating it is with parametric density models (e.g., Gaussian Distribution), which is unsatisfactory for real-world high-dimension signals.

Given the fundamental limitations of MI estimation, recent work has focused on deriving lower bounds on MI. Intuitively, these bounds are based on the following idea: If a classifier can accurately distinguish between samples drawn from the joint $p(x, y)$ and those drawn from the product of marginals $p(x)p(y)$, then $X$ and $Y$ have a high MI.

The Donsker-Varadhan representation.

The following theorem gives a representation of the KL divergence.

where the supremum is taken over all functions $T$ such that the two expectations are finite.

挑选出一个使$\mathbb{E}_{\mathbb{P}}[T]-\log \left(\mathbb{E}_{\mathbb{Q}}\left[e^{T}\right]\right)$最大的函数,我们只需要构造一个在每一个自变量取值处都取到函数最大值的最优函数$T$. 从而可以证明这条式子是成立的(在离散情况下). Donsker-Varadhan representation提供了一个紧的下界,为了estimate KL divergence,我们只需要将这个实值函数$T$参数化,去maximize对应的函数值即可:

Given pairs of samples $x_n \sim \mathbb P, y_n \sim \mathbb Q, n=1, \cdots, N$. we can estimate the $\sup$ over all mappings with the optimization of a model:

For estimating mutual informaiton, the last thing required is to construct joint distribution and marginal product distribution: given pairs of samples i.i.d $\sim \mu_{A,B}$

with $\sigma$ a random permutation (随机排列), we have the marginal product distribution

as i.i.d samples $\sim \mu_{A} \otimes \mu_{B}$, from which we have the estimation method as an optimization process:

InfoNCE

InfoNCE is another MI estimator used in representation learning, which is defined as,

where the expectation is over $K$ independent samples $\left\{\left(x_{i}, y_{i}\right)\right\}_{i=1}^{K}$ from the joint distribution $p(x, y)$.

JSD

根据fGAN中提出的框架,我们可以构造一个Jensen-Shannon MI Estimator:

where sp is the softplus function:

MI Maximization

Motivation

Refer this section to paper “Learning Deep Representations by Mutual Information Estimation and Maximization”.

Incorporating knowledge about locality in the input into the objective can significantly improve a representation’s suitability for downstream tasks. We further control characteristics of the repre- sentation by matching to a prior distribution adversarially.

We leverage MI estimation for representation learning and show that, depending on the downstream task, maximizing MI between the complete input and the encoder output (i.e., global MI) is often insufficient for learning useful representations.

Representation Learning

Let $\mathcal X$ and $\mathcal Y$ be the domain and range of a continuous and (almost everywhere) differentiable parametric function, $E_{\psi}: \mathcal{X} \rightarrow \mathcal{Y}$ with parameters $\psi$ (e.g., a neural network). Mutual Information Maximization is to: find the set of parameters, $\psi$, such that the mutual information, $\mathcal{I}\left(X ; E_{\psi}(X)\right)$, is maximized. This process is often used for finding representations for a given dataset. (This part refers to “Learning Deep Representations by Mutual Information Estimation and Maximization”)

InfoMax finds that exact KL-based formulation of MI is not necessary, for example, a simple alternative based on the Jensen-Shannon divergence (JSD) is more stable and provides better results. At a high level, we optimize $E_{\psi}$ by simultaneously estimating and maximizing $\mathcal{I}\left(X ; E_{\psi}(X)\right)$, where $G$ represents global maximization:

Global/Local-MI Maximization

The objective above can be used to maximize MI between input and output, but ultimately this may be undesirable depending on the tasks. Suppose the intermediate representation of the input preserves the characteristics of locality, which is represented as $C_{\psi}(x):=\left\{C_{\psi}^{(i)}\right\}_{i=1}^{M}$ , and $E_{\psi}(x)=f_{\psi} \circ C_{\psi}(x)$. Define our MI estimator on global/local pairs, maximizing the average estimated MI:

Multiview-MI Maximization

Multiview-MI Maximization includes various objectives among which the MI is computed into one unified framework. (refer to paper “On Mutual Information Maximization for Representation Learning”)

In the image classification setting, for a given image $X$, let $X^{(1)}$ and $X^{(2)}$ be different, possibly overlapping views of $X$, for instance the top and bottom halves of the image. These are encoded using encoders $g_1$ and $g_2$ respectively, and the MI between the two representationsg $g_1(X^{(1)})$ and $g_2(X^{(2)})$ is maximized,

where $\hat I$ represents the sample based MI estimator of the true MI $I(X;Y)$ and the function classes $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ can be used to specify structural constraints on the encoders. One thing to note here is that two encoders $g_1$ and $g_2$ often share parameters.

It gives us plenty of modeling flexibility, as the two views can be chosen to capture completely different aspects and modalities of the data, for example:

  • In the basic form of DeepInfoMax $g_1$ extracts global features from the entire image $X^{(1)}$ and $g_2$ local features from image patches $X^{(2)}$, where $g_1$ and $g_2$ correspond to activations in different layers of the same convolutional network. (也就是上一节介绍的global/local-MI Maximization)
  • Contrastive multiview coding (CMC) generalizes the objective in to consider multiple views $X^{(i)}$, where each $X^{(i)}$ corresponds to a different image modality (e.g., different color channels, or the image and its segmentation mask).
  • Contrastive predictive coding(CPC) incorporates a sequential component of the data. Concretely, one extracts a sequence of patches from an image in some fixed order, maps each patch using an encoder, aggregates the resulting features of the first $t$ patches into a context vector, and maximizes the MI between the context and features extracted from the patch at position $t + k.$

MI Maximization Papers

Fundamental research work.

  • [x] LEARNING DEEP REPRESENTATIONS BY MUTUAL INFORMATION ESTIMATION AND MAXIMIZATION (ICLR 2019)
  • [ ] Representation Learning with Contrastive Predictive Coding
  • [x] ON MUTUAL INFORMATION MAXIMIZATION FOR REPRESENTATION LEARNING (ICLR 2020)
  • [x] Constrastive Multiview Coding

MI maximization with other constraints

  • [ ] MI maximization in graphs.

  • [ ] DEEP GRAPH INFOMAX (ICLR 2019)

  • [ ] SPATIO-TEMPORAL DEEP GRAPH INFOMAX (RLGM, ICLR 2019)
  • [x] INFOGRAPH: UNSUPERVISED AND SEMI-SUPERVISED GRAPH-LEVEL REPRESENTATION LEARNING VIA MUTUAL INFORMATION MAXIMIZATION (ICLR 2020)
  • [ ] Unsupervised Hierarchical Graph Representation Learning by Mutual Information Maximization
  • [ ] Unsupervised Attributed Multiplex Network Embedding
  • [ ] Graph Representation Learning via Graphical Mutual Information Maximization
  • [ ] Heterogeneous Deep Graph Infomax

Application of the MI in other fields.

  • [ ] Utilizing Edge Features in Graph Neural Networks via Variational Information Maximization

  • [ ] GNNExplainer: Generating Explanations for Graph Neural Networks

  • [ ] MINT: Deep Network Compression via Mutual Information-based Neuron Trimming

CATALOG
  1. 1. MI Definition
  2. 2. MI Estimation
    1. 2.1. The Donsker-Varadhan representation.
    2. 2.2. InfoNCE
    3. 2.3. JSD
  3. 3. MI Maximization
    1. 3.1. Motivation
    2. 3.2. Representation Learning
    3. 3.3. Global/Local-MI Maximization
    4. 3.4. Multiview-MI Maximization
  4. 4. MI Maximization Papers