Mutual Information Maximization - Experiments

字数统计: 3k阅读时长: 16 min

 2020/05/13   Share

Materials

MI Estimation

Correlated gaussian variables

Let $(X, Y)^T$ be a zero-mean Gaussian random vector with covariance matrix given by

$\mathbf{C}=\left(\begin{array}{cc}\sigma^{2} & \rho \sigma^{2} \\ \rho \sigma^{2} & \sigma^{2}\end{array}\right)$

Then the theoratical MI of the correated Gaussian distribution is given by

$\begin{align} I(X ; Y) &=h(Y)-h(Y | X) \\ &=h(Y)+h(X)-h(X, Y) \\ &=-\frac{1}{2} \ln \left(1-\rho^{2}\right) \end{align}$

需要注意的是，材料当中给出的公式是$\log$，但这log表示的就是以自然对数为底的，这一点和numpy中的log函数规范一致，即log2才表示已2为底的取对数。接着我们通过numpy简单生成correlated guassian samples.

mean = [0, 0]
cov = [[1, 0.6], [0.6,1]]
ro = cov[0][1] / cov[1][1]
mi = -0.5 * np.log(1-ro**2)

X, Y = np.random.multivariate_normal(mean, cov, 500).T
plt.plot(X, Y, '.')
plt.axis('equal')
plt.title('true-mi = {:.4}'.format(mi))
plt.show()

从而我们得到了一个带有Ground-truth MI的样本.

MI estimation with linear discriminator

We built a Linear-Discriminator-based MI estimator from scratch，其中采用的优化方法是梯度下降法。Linear discriminator会将两个sample $(a_n, b_n)$ 进行concatenation，通过一个简单的线性层即可就产生出scalar output. 因为是最简单的模型所以我们可以根据公式直接计算出每一步梯度的函数表示（具体看 KLD_estimate 和 gradient_descent 函数）。

class MIEstimator:
    def __init__(self, X, Y, lr=0.01):
        """
        X: N * D1
        Y: N * D2
        w: (D1+D2+1) * 1
        """
        N = len(X)
#         x_join_y = np.hstack([X, Y, np.ones([N,1])])
        x_join_y = np.hstack([X, Y])

#         x_prod_y = np.hstack([X, Y_, np.ones([N,1])])
        Y_ = np.vstack([np.array(Y) for _ in range(5)])
        np.random.shuffle(Y_)
        x_prod_y = np.hstack([np.vstack([np.array(X) for _ in range(5)]), Y_])
        
        w = np.ones(shape=[x_join_y.shape[1], 1])
        
        self.N = N
        self.x_join_y = x_join_y
        self.x_prod_y = x_prod_y
        self.w = w
        self.lr = lr
    
    @property
    def KLD_estimate(self):
        res =  np.matmul(self.x_join_y, self.w).mean()
        res -= np.log2(np.exp(np.matmul(self.x_prod_y, self.w)).mean())
        return res
    
    def grad_descent(self):
        delta_w = self.x_join_y.mean(axis=0, keepdims=True).T

        exp_fx = np.exp(np.matmul(self.x_prod_y, self.w))
        last_item = 1/np.log(2) * 1/exp_fx.sum() * np.matmul(self.x_prod_y.T, exp_fx)

        delta_w -= last_item
        self.delta_w = delta_w

        self.w += delta_w * self.lr
    
    def optimize(self):
        for _ in range(1000):
            self.grad_descent()
            print('------------------')
            print(self.KLD_estimate)
        
        return self.KLD_estimate

通过实验我们发现，简单的linear discriminator不能够完成MI Estimation的任务（几乎拟合的点都在0左右）。同时还有一个不符合常规的情况，那就是如果我们给这个线性模型加上bias，则bias项的gradient将会变成常数（可以数学推导），使得estimation无限增大。这说明了$sup$在所有函数中找到最优函数是一件非常困难的事情，使用更加复杂的模型来做estimation才是正确的道路。

MI estimation with neural networks

为了使用更加复杂的非线性函数，我们使用github for MINE中pytorch实现Neural Estimation for MI的代码.

class MINE(nn.Module):
    def __init__(self, hidden_size=10):
        super(MINE, self).__init__()
        self.layers= nn.Sequential(nn.Linear(2, hidden_size),
                                   nn.ReLU(),
                                   nn.Linear(hidden_size,hidden_size),
                                   nn.ReLU(),
                                   nn.Linear(hidden_size, 1)
                                  )

    def forward(self, x, y):
        batch_size = x.size(0)
        tiled_x = torch.cat([x, x,], dim=0)
        idx = torch.randperm(batch_size)
        
        shuffled_y = y[idx]
        concat_y = torch.cat([y, shuffled_y], dim=0)
        
        inputs = torch.cat([tiled_x, concat_y], dim=1)
        logits = self.layers(inputs)

        pred_xy = logits[:batch_size]
        pred_x_y = logits[batch_size:]
        loss = -(torch.mean(pred_xy) 
               - torch.log(torch.mean(torch.exp(pred_x_y))))

        return loss

在不同相关性上的Correlated Gaussian Distribution上进行estimation。非常凑巧的是，原文中也做了双变量高斯分布的实验，并且和其他的estimation算法进行比较，得到的结果也非常优秀：几乎和True MI完全重合.

# Hyperparameters
hidden_size=10
n_epoch = 500
data_size = 2000

# Create model
model = MINE(hidden_size)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
all_mi = []
true_mi = []

mean = [0, 0]
for i in range(6):
    # change the covariance matrix
    cov = [[1, 2*i/10], [2*i/10, 1]]
    ro = cov[0][1] / cov[1][1]
    mi = -0.5 * np.log(1-ro**2)
    
    # new training trajactory
    true_mi.append(mi)
    all_mi.append([])
    for epoch in tqdm(range(n_epoch)):
        X, Y = np.random.multivariate_normal(mean, cov, data_size).T
        x_sample = np.expand_dims(X, axis=1)
        y_sample = np.expand_dims(Y, axis=1)

        x_sample = torch.from_numpy(x_sample).float()
        y_sample = torch.from_numpy(y_sample).float()
        loss = model(x_sample, y_sample)

        model.zero_grad()
        loss.backward()
        optimizer.step()

        all_mi[-1].append(-loss.item())

print(true_mi)
print(all_mi)

最终利用以下code将结果进行可视化.

plt.rcParams["figure.figsize"] = (25,10) # Set the figure size
plt.suptitle('Simple NeuralNet MI Estimation') # super title for all subplots
for i in range(5):
    # return the axis object.
    ax = plt.subplot(2, 3, i+1)
    ax.set_ylim([-0.05, 0.8])
    ax.plot(range(len(all_mi[i])), all_mi[i], color='r', label='Estimated MI')
    ax.plot([0, len(all_mi[i])], [true_mi[i],true_mi[i]], color='b', label='Ground-truth MI')
#     ax.set_xlabel('training steps')
    ax.legend(loc='best')
    ax.set_title('ro={:.2f}'.format(0.2*i))

plt.show()

MI Maximization

Inductive unsupervised representation learning via MI maximization

Assume we have the simulated unsupervised dataset $\tilde X = g(X)$, without knowing the form or any other knowledge of generation function $g$.

The target of the representation learning is to learn a model $M_{\psi}: \tilde X \rightarrow \hat X$ which maximize the mutual information between $\tilde X$ and $\hat X$. By doing so, we hope the model $M_\psi$ will help come into use one day in the future (using the knowledge of unlabeled data for coming supervised learning).

Theoretical solution

Recall we use Donsker-Varadhan representation of KL divergence to estimate MI:

$\mathcal{I}(X ; Y):=\mathcal{D}_{K L}(\mathbb{I} \| \mathbb{M}) \geq \widehat{\mathcal{I}}_{\omega}^{(D V)}(X ; Y):=\mathbb{E}_{\mathbb{J}}\left[T_{\omega}(x, y)\right]-\log \mathbb{E}_{\mathbb{M}}\left[e^{T_{\omega}(x, y)}\right]$

The goal of learning the best moddel $M$ becomes simultaneously estimating and maximizing MI.

$(\hat{\omega}, \hat{\psi})_{G}=\underset{\omega, \psi}{\arg \max } \widehat{\mathcal{I}}_{\omega}\left(\tilde X ; M_{\psi}(\tilde X)\right)$

where $\hat I$ is the total loss computed completely without supervision, i.e., given only $\tilde X$.

Model implementation

Now we only need to treat them as a joint optimization problem. The internal logic of this unsupervised training setting is as follows:

we have two smaller models - a inductive representation generator and an MI estimator - residing in the main model;
when given a mini-batch of samples of $\tilde X$, the generator maps $\tilde X$ to $\hat X$;
then the MI estimator treat the $(\tilde X, \hat X)$ as two distribution, estimate the mutual information of them, or more precisely, the lower bound of the their mutual information;
- when optimizing the MI estimation, the gradient of the objective also flows back to the generator;
- that’s why we say they are trained jointly.

import torch.nn as nn 
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm

class MIMaximizer(nn.Module):
    def __init__(self, input_size=5, hidden_size=10):
        super(MIMaximizer, self).__init__()
        self.est_layers = nn.Sequential(nn.Linear(input_size+1, hidden_size),
                                        nn.ReLU(),
                                        nn.Linear(hidden_size,hidden_size),
                                        nn.ReLU(),                                   
                                        nn.Linear(hidden_size, 1)
                                       )
        
        # model that generates representation of given input x
        self.model_layers = nn.Sequential(nn.Linear(input_size, hidden_size),
                                         nn.ReLU(),
                                         nn.Linear(hidden_size,hidden_size),
                                         nn.ReLU(),                                   
                                         nn.Linear(hidden_size, 1)
                                        )
        

    def forward(self, x):
        y = self.model_layers(x)
        
        batch_size = x.size(0)
        tiled_x = torch.cat([x, x,], dim=0)
        idx = torch.randperm(batch_size)
        
        shuffled_y = y[idx]
        concat_y = torch.cat([y, shuffled_y], dim=0)
        
        inputs = torch.cat([tiled_x, concat_y], dim=1)
        logits = self.est_layers(inputs)

        pred_xy = logits[:batch_size]
        pred_x_y = logits[batch_size:]
        loss = -(torch.mean(pred_xy) 
               - torch.log(torch.mean(torch.exp(pred_x_y))))
        
        return y, loss

Simulating high-dimensional data

To simulate high-dimensional data, the generation function $g: x \rightarrow [f_1(x), f_2(x), \cdots, f_n(x)]$ is to generate high dimensional representation of $x$. For the sake of simplicity, the original $x$ is sampled from uniform distribution. And we can also add some noise into the model to make it more close to the real-world scenario.

One reason to play with such a toy setting is that now we can treat the problem of finding function $g^{-1}: [f_1(x), f_2(x), \cdots, f_n(x)] \rightarrow x$ as a supervised regression problem, and we have the ground-truth label of them!

The experimented data generation methods are as follows:

$x \rightarrow [x^1-x^2, x^2-x^3, \cdots, x^k-x^{k+1}, \cdots, x^{n-1}-x^n, x^n]$, when reversed as a regression task, is linear-solvable.
$x \rightarrow [x^1-x^2, x^2-x^3, \cdots, x^k-x^{k+1}, \cdots, x^{n-1}-x^n]$, when reversed as a regression task, is not linear-solvable, but can be easily approximated with linear model when $n$ is large.
$x \rightarrow [\sin(x), \cos(x), \log(2+x), x^2, \sinh(x), \cosh(x), \tanh(x)]$, where each non-linear function $f$ can be further mapped to even more complex space by $J: f \rightarrow [f(x)^{\frac{1}{i+1}}|x: x \in \operatorname{dom}_f]$

Sanity check

Target: check whether the algorithm works.
Method: using LR model to fit on the generated representation and the ground truth $(\tilde X, X)$, to see whether it’s MSE is minimized along the MI maximization.
Data: We use $x$ -> $[x^1-x^2, x^2-x^3, \cdots, x^k-x^{k+1}, \cdots, x^{n-1}-x^n, x^n]$ to generate high-dimensional representation of $x$.

Data Generation:

x = np.random.uniform(low=-1.,high=1.,size=3000).reshape([-1,1])

# generate x_representation based on [x^2, x^3, ... x^n]
n = 5
x1 = np.hstack([np.power(x, i) for i in range(2, 2 + n)])
x2 = np.hstack([np.power(x, i) for i in range(1, 1 + n)])
x_tilde = np.hstack([x2 - x1, np.power(x, n+1)])
input_size = n+1

The result of linear-solvable data $x \rightarrow [x^1-x^2, x^2-x^3, \cdots, x^k-x^{k+1}, \cdots, x^{n-1}-x^n, x^n]$ :

Influence of dimensionality of representation

Target: find how the dimension of the representation matters.
Method: use different output dimension.
Data: We use $x$ -> $[x^1-x^2, x^2-x^3, \cdots, x^k-x^{k+1}, \cdots, x^{n-1}-x^n]$ to generate high-dimensional representation of $x$, which is not linear-solvable.

Data Generation:

x = np.random.uniform(low=-1.,high=1.,size=3000).reshape([-1,1])

# generate x_representation based on [x^2, x^3, ... x^n]
n = 10
x1 = np.hstack([np.power(x, i) for i in range(2, 2 + n)])
x2 = np.hstack([np.power(x, i) for i in range(1, 1 + n)])
x_tilde = np.hstack([x2 - x1])
input_size = n

Define train() function, train the model across different dimensionality setting:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def train(x_tilde, n_epoch=1000, output_size=1, hidden_size=10, lr=0.01, log_every_n_steps=10):
    input_size = x_tilde.shape[1]
    
    # create the model 
    model = MIMaximizer(
        input_size=input_size, 
        hidden_size=hidden_size, 
        output_size=output_size
    )
    
    # print([_.size() for _ in model.parameters()])
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

    all_mi = []
    best_linear_model_loss = []

    x_sample =  torch.from_numpy(x_tilde).float()
    for epoch in tqdm(range(n_epoch)):
        y, loss = model(x_sample)

        model.zero_grad()
        loss.backward()
        optimizer.step()

        if epoch % log_every_n_steps == 0:
            try:
                # record the MI
                all_mi.append(-loss.item())

                # find the best simple method modeling generated representation of x_tilde.
                y_ = y.detach().numpy()
                reg = LinearRegression().fit(y_, x)
                x_hat = reg.predict(y_)
                mse = mean_squared_error(x, x_hat)
                best_linear_model_loss.append(mse)

            except:
                print('nan!')
    
    return all_mi, best_linear_model_loss


hidden_size = 10
n_epoch = 10000
output_size = 5
lr = 0.01
log_every_n_steps = 10

all_mi_across_models = []
all_mse_across_models = []

for output_size in range (1,6):
    mis, mses = train(
        x_tilde=x_tilde,
        n_epoch=n_epoch,
        output_size=output_size,
        hidden_size=hidden_size,
        lr=lr,
        log_every_n_steps=10   
    )
    all_mi_across_models.append(mis)
    all_mse_across_models.append(mses)

Use the following code to visualize:

plt.rcParams["figure.figsize"] = (20,10)
plt.suptitle('Study of Dimension of Representation')

ax = plt.subplot(211)
for i in range(5):
    ax.plot(range(len(all_mi_across_models[i])), all_mi_across_models[i], label='Dim={}'.format(i+1))
    ax.legend(loc='best')
    ax.set_title('MI Maximization')
    
ax = plt.subplot(212)
for i in range(5):
    ax.plot(range(len(all_mse_across_models[i])), all_mse_across_models[i], label='Dim={}'.format(i+1))
    ax.legend(loc='best')
    ax.set_title('Best LR-MSE')

(1) Data without noise:

(2) Data with noise (scale=0.05):

Unsupervised Representation Learning

Target: Study the unsupervised learning using MI maximization.
Method: unsupervised training dataset, supervised training dataset, testset.
Data: We create the dataset (unsupervised/supervised/test dataset)

some tricks regarding generating a list of functions: 廖雪峰python教程

Data Generation given a list of functions:

def gen_data(size, g_fns, split=(0.9, 0.05, 0.05), noise_scale=0.05):
    x = np.random.uniform(low=-1.,high=1.,size=size).reshape([-1,1])
    
    # g_fns should be numpy functions which support broadcasting. 
    x_tilde = np.hstack([fn(x) for fn in g_fns])
    noise = noise_scale * np.random.normal(loc=0, scale=1, size=x_tilde.shape)
    x_tilde += noise
    
    # split the data
    unsup_x = x_tilde[:int(split[0]*size),:]
    
    sup_x = x_tilde[int(split[0]*size):int((1-split[2])*size),:]
    sup_y = x[int(split[0]*size):int((1-split[2])*size),:]
    
    test_x = x_tilde[int((1-split[2])*size):,:]
    test_y = x[int((1-split[2])*size):,:]
    
    return unsup_x, sup_x, sup_y, test_x, test_y


# some tricks for generating the function list
g_fns = []
for i in range(1, 20):
    def f(i):
        def j(x):
            return np.power(x, i-1) - np.power(x, i)
        return j
        
    g_fns.append(f(i))

unsup_x, train_x, train_y, test_x, test_y = gen_data(1000, g_fns=g_fns, noise_scale=0.2)

We use the method introduced above to generate data of 20 dimension to evaluate the performance of the MI maximization method. Since the higher the dimension, the more linear-solvable the data is, we can expect that simple Linear Regression would solve it very well. Here is the result on different setting of noise.

Results on linear-solvable data - study of noise

Results on non-linear-solvable data

Replace $\tilde X$ with non-linearity following:

root_g_fns = [np.sin, np.cos, lambda x: np.log(2+x), np.square, np.sinh, np.cosh, np.tanh]
for root_f in root_g_fns:
    for i in range(1, 5):
        def f(i):
            def j(x):
                return np.sign(root_f(x)) * ((np.abs(root_f(x))) ** (1/(i+1)))
            return j
        g_fns.append(f(i))

以上的结果是在数据维度为30维左右的，我们继续增大simulated data的维度到130维，我们可以发现，普通的LR模型会显著过拟合，而采用MI无监督学习过后的模型有更好的generalization能力。

因为维度的高低是和数据量共同影响model效果的，所以我们想再探究一下稍微大一些的数据量的表现，如下图所示，我们发现即便数据量是维度的4倍左右（500/133），SL也并没有经过无监督学习过后的模型表现更好。

Results on shifted and scaled unsupervised data - study of dimensionality

Now all the data points are drawn from the exact same distribution, it’s time to check how the algorithm behave when there is shift/scale of the unsupervised data distribution.

def gen_data(size, g_fns, split=(0.9, 0.05, 0.05), noise_scale=0.05, unsup_scale=(1., 1.5), unsup_shift=0.5):
    # ... omitted. 
    
    # split the data
    unsup_x = x_tilde[:int(split[0]*size),:]
    # add some distribution shift/scale to the unsupervised dataset
    unsup_scale = np.random.uniform(low=unsup_scale[0], high=unsup_scale[1], size=unsup_x.shape[0]).reshape([-1,1])
    unsup_shift = np.random.normal(loc=unsup_shift, scale=0.05)
    unsup_x = unsup_scale * unsup_x + unsup_shift
    
    # ... omitted.