**Materials**

- Wikipedia JS divergence
- paper “InfoGraph - Unsupervised and Semi-Supervised Graph-Level Representation Learning via Mutual Information Maximization”
- paper “Mutual Information Neural Estimation”
- paper “Learning Deep Representations by Mutual Information Estimation and Maximization”
- slides on MI estimation
- paper “On Mutual Information Maximization for Representation Learning”

# MI Definition

Recall the mutual information can be computed by information entropy or **by KL divergence**, the latter of which is often used for estimation.

# MI Estimation

When it’s for the discrete variables, mutual information is easy to compute. But when in the continuous case, the standard approach to estimating it is with parametric density models (e.g., Gaussian Distribution), which is unsatisfactory for real-world high-dimension signals.

Given the fundamental limitations of MI estimation, recent work has focused on deriving lower bounds on MI. Intuitively, these bounds are based on the following idea: If a classifier can accurately distinguish between samples drawn from the joint $p(x, y)$ and those drawn from the product of marginals $p(x)p(y)$, then $X$ and $Y$ have a high MI.

## The Donsker-Varadhan representation.

The following theorem gives a representation of the KL divergence.

where the **supremum is taken over all functions** $T$ such that the two expectations are finite.

挑选出一个使$\mathbb{E}_{\mathbb{P}}[T]-\log \left(\mathbb{E}_{\mathbb{Q}}\left[e^{T}\right]\right)$最大的函数，我们只需要构造一个在每一个自变量取值处都取到函数最大值的最优函数$T$. 从而可以证明这条式子是成立的（在离散情况下）. Donsker-Varadhan representation提供了一个紧的下界，为了estimate KL divergence，我们只需要将这个实值函数$T$参数化，去maximize对应的函数值即可：

Given pairs of samples $x_n \sim \mathbb P, y_n \sim \mathbb Q, n=1, \cdots, N$. we can estimate the $\sup$ over all mappings with the optimization of a model:

For estimating mutual informaiton, the last thing required is to construct joint distribution and marginal product distribution: given pairs of samples i.i.d $\sim \mu_{A,B}$

with $\sigma$ a random permutation (随机排列), we have the marginal product distribution

as i.i.d samples $\sim \mu_{A} \otimes \mu_{B}$, from which we have the estimation method as an optimization process:

## InfoNCE

InfoNCE is another MI estimator used in representation learning, which is defined as,

where the expectation is over $K$ independent samples $\left\{\left(x_{i}, y_{i}\right)\right\}_{i=1}^{K}$ from the joint distribution $p(x, y)$.

# MI Maximization

Let $\mathcal X$ and $\mathcal Y$ be the domain and range of a continuous and (almost everywhere) differentiable parametric function, $E_{\psi}: \mathcal{X} \rightarrow \mathcal{Y}$ with parameters $\psi$ (e.g., a neural network). **Mutual Information Maximization** is to: find the set of parameters, $\psi$, such that the mutual information, $\mathcal{I}\left(X ; E_{\psi}(X)\right)$, is maximized. This process is often used for finding representations for a given dataset. (This part refers to “Learning Deep Representations by Mutual Information Estimation and Maximization”)

**InfoMax** finds that exact KL-based formulation of MI is not necessary, for example, a simple alternative based on the Jensen-Shannon divergence (JSD) is more stable and provides better results. At a high level, we optimize $E_{\psi}$ by **simultaneously estimating and maximizing** $\mathcal{I}\left(X ; E_{\psi}(X)\right)$, where $G$ represents global maximization:

根据fGAN中提出的框架，我们可以构造一个Jensen-Shannon MI Estimator:

## Global/Local-MI Maximization

The objective above can be used to maximize MI between input and output, but ultimately this may be undesirable depending on the tasks. Suppose the intermediate representation of the input preserves the characteristics of locality, which is represented as $C_{\psi}(x):=\left\{C_{\psi}^{(i)}\right\}_{i=1}^{M}$ , and $E_{\psi}(x)=f_{\psi} \circ C_{\psi}(x)$. Define our MI estimator on global/local pairs, maximizing the average estimated MI:

## Multiview-MI Maximization

**Multiview-MI Maximization includes various objectives among which the MI is computed into one unified framework.** (refer to paper “On Mutual Information Maximization for Representation Learning”)

In the image classification setting, for a given image $X$, let $X^{(1)}$ and $X^{(2)}$ be different, possibly overlapping views of $X$, for instance the top and bottom halves of the image. These are encoded using encoders $g_1$ and $g_2$ respectively, and the MI between the two representationsg $g_1(X^{(1)})$ and $g_2(X^{(2)})$ is maximized,

where $\hat I$ represents the sample based MI estimator of the true MI $I(X;Y)$ and the function classes $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ can be used to specify structural constraints on the encoders. One thing to note here is that two encoders $g_1$ and $g_2$ often share parameters.

It gives us plenty of modeling flexibility, as the two **views** can be chosen to capture completely different aspects and modalities of the data, for example:

- In the basic form of
**DeepInfoMax**$g_1$ extracts global features from the entire image $X^{(1)}$ and $g_2$ local features from image patches $X^{(2)}$, where $g_1$ and $g_2$ correspond to activations in different layers of the same convolutional network. (也就是上一节介绍的global/local-MI Maximization) **Contrastive multiview coding (CMC)**generalizes the objective in to consider multiple views $X^{(i)}$, where each $X^{(i)}$ corresponds to a different image modality (e.g., different color channels, or the image and its segmentation mask).**Contrastive predictive coding(CPC)**incorporates a sequential component of the data. Concretely, one extracts a sequence of patches from an image in some fixed order, maps each patch using an encoder, aggregates the resulting features of the first $t$ patches into a context vector, and maximizes the MI between the context and features extracted from the patch at position $t + k.$