N1H111SM's Miniverse

2020/05/10 Share

Materials

MI Definition

Recall the mutual information can be computed by information entropy or by KL divergence, the latter of which is often used for estimation.

MI Estimation

When it’s for the discrete variables, mutual information is easy to compute. But when in the continuous case, the standard approach to estimating it is with parametric density models (e.g., Gaussian Distribution), which is unsatisfactory for real-world high-dimension signals.

Given the fundamental limitations of MI estimation, recent work has focused on deriving lower bounds on MI. Intuitively, these bounds are based on the following idea: If a classifier can accurately distinguish between samples drawn from the joint $p(x, y)$ and those drawn from the product of marginals $p(x)p(y)$, then $X$ and $Y$ have a high MI.

The following theorem gives a representation of the KL divergence.

where the supremum is taken over all functions $T$ such that the two expectations are finite.

Given pairs of samples $x_n \sim \mathbb P, y_n \sim \mathbb Q, n=1, \cdots, N$. we can estimate the $\sup$ over all mappings with the optimization of a model:

For estimating mutual informaiton, the last thing required is to construct joint distribution and marginal product distribution: given pairs of samples i.i.d $\sim \mu_{A,B}$

with $\sigma$ a random permutation (随机排列), we have the marginal product distribution

as i.i.d samples $\sim \mu_{A} \otimes \mu_{B}$, from which we have the estimation method as an optimization process:

InfoNCE

InfoNCE is another MI estimator used in representation learning, which is defined as,

where the expectation is over $K$ independent samples $\left\{\left(x_{i}, y_{i}\right)\right\}_{i=1}^{K}$ from the joint distribution $p(x, y)$.

MI Maximization

Let $\mathcal X$ and $\mathcal Y$ be the domain and range of a continuous and (almost everywhere) differentiable parametric function, $E_{\psi}: \mathcal{X} \rightarrow \mathcal{Y}$ with parameters $\psi$ (e.g., a neural network). Mutual Information Maximization is to: find the set of parameters, $\psi$, such that the mutual information, $\mathcal{I}\left(X ; E_{\psi}(X)\right)$, is maximized. This process is often used for finding representations for a given dataset. (This part refers to “Learning Deep Representations by Mutual Information Estimation and Maximization”)

InfoMax finds that exact KL-based formulation of MI is not necessary, for example, a simple alternative based on the Jensen-Shannon divergence (JSD) is more stable and provides better results. At a high level, we optimize $E_{\psi}$ by simultaneously estimating and maximizing $\mathcal{I}\left(X ; E_{\psi}(X)\right)$, where $G$ represents global maximization:

Global/Local-MI Maximization

The objective above can be used to maximize MI between input and output, but ultimately this may be undesirable depending on the tasks. Suppose the intermediate representation of the input preserves the characteristics of locality, which is represented as $C_{\psi}(x):=\left\{C_{\psi}^{(i)}\right\}_{i=1}^{M}$ , and $E_{\psi}(x)=f_{\psi} \circ C_{\psi}(x)$. Define our MI estimator on global/local pairs, maximizing the average estimated MI:

Multiview-MI Maximization

Multiview-MI Maximization includes various objectives among which the MI is computed into one unified framework. (refer to paper “On Mutual Information Maximization for Representation Learning”)

In the image classification setting, for a given image $X$, let $X^{(1)}$ and $X^{(2)}$ be different, possibly overlapping views of $X$, for instance the top and bottom halves of the image. These are encoded using encoders $g_1$ and $g_2$ respectively, and the MI between the two representationsg $g_1(X^{(1)})$ and $g_2(X^{(2)})$ is maximized,

where $\hat I$ represents the sample based MI estimator of the true MI $I(X;Y)$ and the function classes $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ can be used to specify structural constraints on the encoders. One thing to note here is that two encoders $g_1$ and $g_2$ often share parameters.

It gives us plenty of modeling flexibility, as the two views can be chosen to capture completely different aspects and modalities of the data, for example:

• In the basic form of DeepInfoMax $g_1$ extracts global features from the entire image $X^{(1)}$ and $g_2$ local features from image patches $X^{(2)}$, where $g_1$ and $g_2$ correspond to activations in different layers of the same convolutional network. (也就是上一节介绍的global/local-MI Maximization)
• Contrastive multiview coding (CMC) generalizes the objective in to consider multiple views $X^{(i)}$, where each $X^{(i)}$ corresponds to a different image modality (e.g., different color channels, or the image and its segmentation mask).
• Contrastive predictive coding(CPC) incorporates a sequential component of the data. Concretely, one extracts a sequence of patches from an image in some fixed order, maps each patch using an encoder, aggregates the resulting features of the first $t$ patches into a context vector, and maximizes the MI between the context and features extracted from the patch at position $t + k.$