Introduction
Autoencoders are unsupervised neural network models commonly used for tasks like dimensionality reduction and representation learning. In a typical autoencoder, the network is trained to minimize the reconstruction error between the input $ \mathbf{x} $ and its reconstruction $ \mathbf{x}' $. Both the encoder and decoder can be designed as single- or multi-layer networks. While a symmetric architecture–using the same number of layers in both the encoder and decoder–is common, it is not mandatory.
In the encoding stage, the network gradually reduces the dimensionality of $ \mathbf{x} $, compressing it into a latent representation $ \mathbf{z} $. The decoder then takes this compressed representation and attempts to reconstruct the original input $ \mathbf{x} $. If the decoder successfully reconstructs $ \mathbf{x} $, it implies that $ \mathbf{z} $ has captured the essential features of the input, making it an effective representation. This is useful for model construction, as we can extract the latent $ \mathbf{z} $ and use them as covariates for regression/classification tasks.
The sole task of an autoencoder is to minimize the reconstruction error $ || \mathbf{x}-\mathbf{x}' ||^2 $. This means that autoencoders do not care how the latent space is organized as long as it minimizes the reconstruction error. Without an explicit regularization or probabilistic prior, we cannot create a "meaningful" latent representation.
Variational autoencoder
Variational autoencoder (VAE) is an extension of the traditional autoencoder that introduces a probabilistic framework. Instead of encoding input $ \mathbf{x} $ into a single point in the latent space, the encoder outputs a probability distribution $ q_\phi(\mathbf{z}|\mathbf{x}) = N(\mu, \sigma^2\mathbf{I}) $, where the parameters $ \mu(\mathbf{x}) $ and $ \sigma(\mathbf{x}) $ are functions of the input (See figure above). This means that $ \mathbf{z} $ is treated as a random variable rather than a deterministic node$^1$. To generate a latent representation, we sample from this distribution using the reparameterization trick–expressing $ \mathbf{z} $ as
$$ \mathbf{z} = \mu + \sigma \odot \epsilon \quad \text{with} \quad \epsilon \sim N(0, \mathbf{I}) $$
which allows for backpropagation during training.
$ ^1 $Note that we are not sampling $ \mathbf{z} $ directly from $ N(\mu, \sigma^2 \mathbf{I}) $. We are sampling from $ \epsilon \sim N(0, \mathbf{I}) $, then performing $ \mu + \sigma \odot \epsilon $, which allows the latent layer $ \mathbf{z} $ to be deterministic, while keeping a probabilistic framework. If we sampled on $ \mathbf{z} $ directly, we would not be able to backpropagate the gradients past the latent layer.
A key component of VAE is the regularization term, often referred to as structural regularization. Assuming a prior distribution $ p_\theta(\mathbf{z}) = N(0, \mathbf{I}) $, the model minimizes the Kullback-Leibler (KL) divergence between the approximate posterior $ q_\phi(\mathbf{z}|\mathbf{x}) $ (also known as the variational distribution) and prior $ p_\theta(\mathbf{z}) $. This KL divergence loss, along with the reconstruction loss, encourages the latent representations to be distributed close to the prior distribution. As a result, the latent space becomes smooth and continuous, which is beneficial for tasks like data generation and interpolation.
Below is a theoretical explanation to why we minimize $ D_{KL}( q(\mathbf{z}|\mathbf{x}) \, || \, p(\mathbf{z}) ) $.
Variational Bayesian inference
$$ \mathbf{z} \sim p_\theta(\mathbf{z})=N(0, \mathbf{I}) $$
$$ \mathbf{x} \sim p_\theta(\mathbf{x}|\mathbf{z}) $$
- $ \mathbf{z} $ is sampled from prior distribution $ p_\theta(\mathbf{z}) $, an isotropic multivariate Normal.
- $ \mathbf{x} $ is generated from the likelihood $ p_\theta(\mathbf{x}|\mathbf{z}) $.
$$ p_\theta(\mathbf{x})=\int p_\theta(\mathbf{z}, \mathbf{x}) d\mathbf{z} = \int p_\theta(\mathbf{z})p_\theta(\mathbf{x}|\mathbf{z}) d\mathbf{z} $$
$$ p_\theta(\mathbf{z}|\mathbf{x})= \frac{ p_\theta(\mathbf{x}|\mathbf{z}) p_\theta(\mathbf{z}) }{ p_\theta(\mathbf{x}) } = \frac{ p_\theta(\mathbf{x}|\mathbf{z}) p_\theta(\mathbf{z}) }{ \int p_\theta(\mathbf{z}) p_\theta(\mathbf{x}|\mathbf{z})d\mathbf{z} } $$
Although the exact form of $ p_\theta(\mathbf{x}|\mathbf{z}) $ may be intractable, it is represented by the decoder network. The intractability of $ p_\theta(\mathbf{x}|\mathbf{z}) $ due to the complexity and non-linearity of neural networks, leads to further challenges when trying to infer the latent structure of the data. It follows that both the marginal likelihood $ p_\theta(\mathbf{x}) $ and the posterior $ p_\theta(\mathbf{z}|\mathbf{x}) $ are intractable.
$$ \hat{\theta} = \arg\!\max\limits_{\theta} \log p_\theta(\mathbf{x}^{(i)}) $$
Since the true parameters $ \theta $ and latent variable $ \mathbf{z} $ are unknown, one approach to estimate $ \theta $ is to apply a maximum-likelihood estimation.
$$ Q(\theta \mid \theta^{\text{(old)}}) = \mathbb{E}_{p_\theta(\mathbf{z} \mid \mathbf{x})} \left[ \log p_\theta(\mathbf{x}, \mathbf{z}) \right] $$
$$ \theta^{\text{(new)}} = \arg\!\max\limits_{\theta} Q(\theta \mid \theta^{\text{(old)}}) $$
One might use the expectation-maximization algorithm when the latent variables are unknown and the marginal likelihood of the observed data is present, which unfortunately is not the case here.
Since $ p_\theta(\mathbf{x}) $ and $ p_\theta(\mathbf{z}|\mathbf{x}) $ are both intractable, we define a variational distribution $ q_\phi(\mathbf{z}|\mathbf{x}) $ that approximates true posterior $ p_\theta(\mathbf{z}|\mathbf{x}) $. We assume that $ q_\phi(\mathbf{z}|\mathbf{x}) = N(\mu, \sigma^2 \mathbf{I}) $, is a multivariate Normal with a diagonal covariance structure. Minimizing the KL divergence between the variational distribution and true posterior distribution allows us to derive the lower bound of marginal likelihood $p_\theta(\mathbf{x}) $ that is tractable.
Below, the parameters $ \theta $ and $ \phi $ are omitted for clarity.
$$ \begin{align*} D_{KL}( q(\mathbf{z}|\mathbf{x}) \, || \, p(\mathbf{z}|\mathbf{x}) ) &\stackrel{\text{def}}{=} \int q(\mathbf{z}|\mathbf{x}) \log \frac{ q(\mathbf{z}|\mathbf{x}) }{ p(\mathbf{z}|\mathbf{x}) } d\mathbf{z} \\ &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x})} [ \log q(\mathbf{z}|\mathbf{x}) - \log p(\mathbf{z}|\mathbf{x}) ] \\ &= \mathbb{E} _{q(\mathbf{z}|\mathbf{x})} [ \log q(\mathbf{z}|\mathbf{x}) - \log \frac{p(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{p(\mathbf{x})} ] \end{align*} $$
$$ \begin{align*} -D_{KL}( q(\mathbf{z}|\mathbf{x}) \, || \, p(\mathbf{z}|\mathbf{x}) ) &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x})} [ \log p(\mathbf{x}|\mathbf{z}) - (\log q(\mathbf{z}|\mathbf{x}) - \log p(\mathbf{z})) ] - \log p(\mathbf{x}) \\ &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x})} [ \log p(\mathbf{x}|\mathbf{z}) ] - D_{KL}( q(\mathbf{z}|\mathbf{x}) \, || \, p(\mathbf{z}) ) - \log p(\mathbf{x}) \end{align*} $$
$$ \therefore D_{KL}( q(\mathbf{z}|\mathbf{x}) \, || \, p(\mathbf{z}|\mathbf{x}) ) = D_{KL}( q(\mathbf{z}|\mathbf{x}) \, || \, p(\mathbf{z}) ) - \mathbb{E}_{q(\mathbf{z}|\mathbf{x})} [ \log p(\mathbf{x}|\mathbf{z}) ] + \log p(\mathbf{x}) $$
Since $ D_{KL}(q \, || \, p) \geq 0 $, the evidence lower bound (ELBO):
$$ \log p(\mathbf{x}) \geq \mathbb{E}_{q(\mathbf{z}|\mathbf{x})} [ \log p(\mathbf{x}|\mathbf{z}) ] - D_{KL}( q(\mathbf{z}|\mathbf{x}) \, || \, p(\mathbf{z}) ) $$
We see that minimizing $ D_{KL}( q(\mathbf{z}|\mathbf{x}) \, || \, p(\mathbf{z}) ) $ will push the lower bound of the marginal likelihood $ p_\theta(\mathbf{x}) $ up.
$ \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})} [ \log p_\theta(\mathbf{x}|\mathbf{z}) ] $ is the reconstruction error. It can be interpreted as the negative cross-entropy (if distributions are assumed to follow a Bernoulli) or negative MSE. Recall that the parameters $ \mu_\phi $ and $ \sigma_\phi^2 $ are already specified by the neural network (Above, see the model architecture figure of VAE). We sample $ \mathbf{z} $, then with the decoder, we compute the reconstruction $ f_\theta(\mathbf{z}) = \mathbf{x}' $. In effect, minimizing the MSE $ ||\mathbf{x} - f_\theta(\mathbf{z})||^2 $ from the sampled $ \mathbf{z} $ is a Monte Carlo sampling method:
$$ p_\theta(\mathbf{x} | \mathbf{z}) = N(f_\theta(\mathbf{z}), \sigma^2I) $$
$$ \log p_\theta(\mathbf{x} | \mathbf{z}) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - f_\theta(z_i))^2 \\ \propto -|| \mathbf{x} - f_\theta(\mathbf{z}) ||^2 $$
Note that we have not estimated $ \sigma_\theta $, but the process from sampled $ \mathbf{z} $ to reconstruction $ \mathbf{x}' $ is specified by the decoder. Also, this Monte Carlo estimate of the reconstruction error is differentiable, allowing gradients to flow back through the decoder and the encoder during training.
'Data Science > Dimensionality Reduction' 카테고리의 다른 글
Linear Discriminant Analysis (Part 1) (1) | 2024.11.27 |
---|