N1H111SM's Miniverse

2020/05/25 Share

Materials

# Motivation

## Structure of Introduction

• Autoencoders treat bits equally.
• We revisit the classic hypothesis that the good bits are the ones that are shared between multiple views of the world. This hypothesis corresponds to the inductive bias that the way you view a scene should not affect its semantics.
• Our goal is therefore to learn representations that capture information shared between multiple sensory channels but that are otherwise compact (i.e. discard channel-specific nuisance factors).
• Our main contribution is to set up a framework to extend these ideas to any number of views.

# Method

## Predictive Learning

Autoencoder方法被归为Predictive Learning的范畴中。这类方法最大的问题在于其优化目标objective假设了每个pixel之间是相互独立的，thereby reducing their ability to model correlations or complex structure. 为什么叫predictive learning呢，是因为在Multiview的setting下，我们希望建立一个从view1->representation->view2的映射，最小化这个预测之间的loss. 这个形式和autoencoder有些类似，关于multiview predictive learning和autoencoder之间的关系，我们做以下图示进行说明：

The good bits are the ones that are shared between multiple views of the world.

## Contrastive Learning

Multiview Contrastive Learning的基本思想是：将同一个样本的不同view为一个正样本对$\left\{v_{1}^{i}, v_{2}^{i}\right\}_{i=1}^{N}$，对于第i个样本我们能够构造不同样本的不同view为一个负样本对$\left\{v_{1}^{i}, v_{2}^{j}\right\}_{j=1}^{K}$。通过训练一个critic function $h_\theta$来区分正负样本对，从而得到representation. 构造contrast loss为以下：

# Experiments

• Two established image representation learning benchmarks: ImageNet and STL-10
• Video representation learning tasks with 2 views: image and optical flow modalities
• More than 2 views.

### How does mutual information affect representation quality?

Here we see that views with too little or too much MI perform worse; a sweet spot in the middle exists which gives the best representation. That there exists such a sweet spot should be expected. If two views share no information, then, in principle, there is no incentive for CMC to learn anything. If two views share all their information, no nuisances are discarded and we arrive back at something akin to an autoencoder or generative model, that simply tries to represent all the bits in the multiview data.

These experiments demonstrate that the relationship between mutual information and representation quality is meaningful but not direct. Selecting optimal views, which just share relevant signal, may be a fruitful direction for future research.