N1H111SM's Miniverse

2020/05/22 Share

Materials

# Latent Variable Models

## Setting

Formally, say we have a vector of latent variables $z​$ in a high-dimensional space $\mathcal Z​$ which we can easily sample according to some probability density function (PDF) $P(z)​$ defined over $\mathcal Z​$. Then, say we have a family of deterministic functions $f (z; \theta)​$, parameterized by a vector $\theta​$ in some space $\Theta​$, where $f:\mathcal{Z} \times \Theta \rightarrow \mathcal{X}​$.

## Target

We wish to optimize $\theta$ such that we can sample $z$ from $P(z)$ and, with high probability, $f (z; θ)$ will be like the $X$’s in our dataset. To make this notion precise mathematically, we are aiming maximize the probability of each $X$ in the training set under the entire generative process, according to:

In VAEs, the choice of this output distribution is often Gaussian, i.e.,

# Variational Autoencoders

VAE的数学推导和Autoencoders并没有太大的关系，为什么会被叫做VAE的原因在于最后从setup推导出的training objective由encoder和decoder两个部分组成，从而成为了AE的形式。

VAE需要解决以下两个问题：（1）如何定义latent variable $z$；（2）如何处理在 $z$ 上的积分.

### How to define the latent variable

• Avoid deciding by hand what information each dimension of $z$ encodes
• Avoid explicitly describing the dependencies—i.e., the latent structure—between the dimensions of $z$.

## Setting up the Objective

The key idea behind the variational autoencoder is to attempt to sample values of $z$ that are likely to have produced $X$, and compute $P(X)$ just from those. This means that we need a new function $Q(z|X)$ which can take a value of $X$ and give us a distribution over $z$ values that are likely to produce $X$. 这样我们在分布 $Q$ 的帮助下，就非常容易计算$E_{z \sim Q} P(X | z)$，但这只是隐变量$z$在分布$Q$下$P(X)$的估计，和真实的$P(X)$之间是有差距的，为了达到最终目的”optimize $P(X)$”，我们需要”relate $E_{z∼Q}P(X|z)$ and $P(X)$”.

This equation serves is the core of the variational autoencoder, and it’s worth spending some time thinking about what it says . In two sentences, the left hand side has the quantity we want to maximize: $log P(X)$ (plus an error term, which makes $Q$ produce $z$’s that can reproduce a given $X$; this term will become small if $Q$ is high-capacity). The right hand side is something we can optimize via stochastic gradient descent given the right choice of $Q$ (although it may not be obvious yet how). Note that the framework—in particular, the right hand side of the Equation—has suddenly taken a form which looks like an autoencoder, since $Q$ is “encoding” $X$ into $z$, and $P$ is “decoding” it to reconstruct $X$. We’ll explore this connection in more detail later.

## Optimizing the Objective

There is, however, a significant problem with this equation. $E_{z∼Q} [\log P(X|z)]$ depends not just on the parameters of $P$, but also on the parameters of $Q$. However, in the equation above, this dependency has disappeared! In order to make VAEs work, it’s essential to drive $Q$ to produce codes for $X$ that $P$ can reliably decode.