N1H111SM's Miniverse

2020/02/17 Share

Materials

# Divergence

Definition (Probability Simplex). For any finite alphabet $\mathcal{X}$ with $|\mathcal{X}|=m$, the probability simplex $\mathcal P^{\mathcal X}$ is the set

Definition (Kullback-Leibler Divergence). For $p, q \in \mathcal{P}^{\mathcal X}$, the Kullback-Leibler (KL) divergence of $q$ from $p$ is

where $\log$ represents $\ln$, and we are using the conventions that $\log \infty=\infty$, $\log 0=-\infty$, $0/0=0$, and $0\times \infty=0$.

Theorem (Gibbs’ Inequality). For all $p, q \in \mathcal{P}^{\mathcal{X}}$ , it holds that $d(p, q) \geq 0$, with equality iff $p = q$.

• 等价于对于任意的真实分布$p$，$\min d(p,q) = 0$，且等号成立条件为$q=p$.
• 使用拉格朗日乘数法计算梯度为0的点即可.

## Jensen-Shannon Divergence

Definition (Jensen-Shannon Divergence) The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the Kullback–Leibler divergence $d(P|Q)$ . It is defined by:

where $M=\frac{1}{2}(P+Q)$. The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm.

# Entropy

Definition (Shannon Entropy). The Shannon entropy of $p \in \mathcal{P}^{\mathcal X}$ is

Theorem The Shannon entropy is related to the divergence according to the formula

## Conditional Entropy

• $p_{X} \in \mathcal{P}^{\mathcal X}$: 一个在alphabet $\mathcal X$上的随机变量$X$的分布$p$
• $p_{Y} \in \mathcal{P}^{\mathcal Y}$: 一个在alphabet $\mathcal Y$上的随机变量$Y$的分布$p$
• $p_{X,Y} \in \mathcal{P}^{\mathcal X \times \mathcal Y}$: 一个在他们上面的联合分布$p$
• $p_{X} \otimes p_{Y} \in \mathcal{P}^{\mathcal{X} \times \mathcal Y}$: 定义为边缘分布的乘积，也即$\left(p_{X} \otimes p_{Y}\right)(x, y)= p_{X}(x) p_{Y}(y)​$.

Definition (Conditional Entropy). The conditional entropy of $X$ given $Y$ is

However, a quick think makes clear that this definition isn’t appropriate, because it doesn’t include any information about the distribution of $Y$. If $Y$ is concentrated around some very informative (or uninformative) values, then $\tilde H$ won’t notice that some values are more valuable than others.

Theorem The conditional entropy is related to the unconditional entropy as

where $H(X,Y)$ is the entropy of the joint distribution $p_{X,Y}$.

The uncertainty you have about $X$ given that you’ve been told $Y$ is equal to the uncertainty you had about both $X$ and $Y$ , less the uncertainty that was resolved when you learned $Y$.

Theorem (Side Information reduces uncertainty).

# Information

Definition (Mutual Information). The mutual information $I(X,Y)$ in $Y$ about $X$ is

Theorem The mutual information may also be written as

Mutual Information在很多情况下也叫做Information Gain（信息增益），是用来衡量两个随机变量之间的dependence的：Mutual Information越大，则两个随机变量间的关联度越高。而根据第一条等式推知：the larger the divergence between the joint and the product of the marginals, the stronger the dependence between the two variables. 而同时我们互换第一条等式$XY$变量可以得到以下：

Corrolary The mutual information is symmetric:

Theorem (Data Processing Inequality). Let $X$ and $Y$ be random variables, and let $Z = g(Y)$, where $g$ is any function $g : Y \rightarrow Z$. Then,