vdgec

Combine VAE

Here we describe the generative process of VDGEC. With $K$ clusters, an observed sample $\mathbf{x} \in \mathbb{R}^D$ is generated by the following process:

choose a cluster $c \sim Cat(\pi)$ , i.e.

$p(c) = Cat(c|\mathbf{\pi})$

choose a latent vector $\mathbf{z} \sim \mathcal{N}(\mathbf{\mu_c}, \mathbf{\sigma_c}^2\mathbf{I})$ , i.e.

$p(\mathbf{z}| c) = \mathcal{N}(\mathbf{\mu_c}, \mathbf{\sigma_c}^2\mathbf{I})$

compute $\mathbf{\mu_x}, \mathbf{\sigma_x^2}$ :
$[\mathbf{\mu_x};\log \mathbf{\sigma_x}^2] = f(\mathbf{z}:\mathbf{\theta})$
choose a sample $\mathbf{x }\sim \mathcal{N}(\mathbf{\mu_x}, \mathbf{\sigma_x}^2\mathbf{I})$ , i.e.

$p(\mathbf{x|z}) = \mathcal{N}(\mathbf{\mu_x}, \mathbf{\sigma_x}^2\mathbf{I})$

where $K$ is a predefined parameter, $\pi_k$ is the prior probability for cluster $k$ , $\mathbf{\pi} \in \mathbb{R}^{K}_{+}, \sum_{k=1}^K\pi_k = 1$ . $Cat(\mathbf{\pi})$ is the categorical distribution parameterized by $\mathbf{\pi}$ . $\mathbf{\mu_c}$ and $\mathbf{\sigma_c^2}$ are the mean and the variance of the Gaussian distribution corresponding to cluster $c$ . $\mathbf{I}$ is an identity matrix, $f(\mathbf{z};\mathbf{\theta})$ is a neural network whose input is $\mathbf{z}$ and is parameterized by $\mathbf{\theta}$ . $\mathcal{N}(\mathbf{\mu_x, \sigma_x^2})$ is multivariate Gaussian distribution parameterized by $\mathbf{\mu_x, \sigma_x}$ .

According to the generative process above, the joint probability $p(\mathbf{x,z},c)$ can be factorized as:

$p(\mathbf{x,z},c) = p(\mathbf{x|z})p(\mathbf{z}|c)p(c)$

as $\mathbf{x}$ and c are independent conditioned on $\mathbf{z}$ .

To maximize the likelihood of the given data points base on the generative process, by using Jensen's inequality, the log-likelihood can be written as:

$\log p(\mathbf{x}) = \log \int_{z}\sum_{c}p(\mathbf{x,z},c)d\mathbf{z} \geq \mathcal{L}(x) = \mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log \frac{p(\mathbf{x,z},c)}{q(\mathbf{z},c|\mathbf{x})}]$

where $\mathcal{L}(\mathbf{x})$ is the evidence lower bound, which can be written as:

$\mathcal{L}(\mathbf{x}) = \int_{\mathbf{z}}q(\mathbf{z|x})\log \frac{p(\mathbf{x|z})p(\mathbf{z})}{q(\mathbf{z|x})}d\mathbf{z} - \int_{\mathbf{z}}q(\mathbf{z|x}) \mathit{D_{KL}}(q(c|\mathbf{x})||p(c|\mathbf{z}))d\mathbf{z}$

or as:

$\mathcal{L}(\mathbf{x}) = \mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(\mathbf{x|z})] - \mathit{D_{KL}}(q(\mathbf{z},c|\mathbf{x}) || p(\mathbf{z},c))$

where $q(\mathbf{z},c|\mathbf{x})$ is the variational posterior to approximate the true posterior $p(\mathbf{z},c|\mathbf{x})$ .

We assume $q(\mathbf{z},c|\mathbf{x})$ to be a meanfield distribution which can be factorized as:

$q(\mathbf{z},c|\mathbf{x}) = q(\mathbf{z}|\mathbf{x})q(c|\mathbf{x})$

In Equation 7, the first term has no relationship with $c$ and the second term is non-negative.

Hence, in order to maximize $\mathcal{L}(\mathbf{x})$ , $\mathit{D_{KL}(q(c|\mathbf{x}) || p(c|\mathbf{z})) \equiv 0}$ should be satisfied, which means,

$q(c|\mathbf{x}) = p(c|\mathbf{z})$

In Equation 8, the first term is the reconstruction term, which encourages VDGEC to explain the dataset well. And the second term is the Kullback-Leibler divergence from the gaussian mixture model prior $p(\mathbf{z}, c)$ to the variational posterior $q(\mathbf{z}, c | \mathbf{x})$ , which regularizes the latent embedding $z$ to lie on a mixture of gaussion manifold.

According to Equation 5 and 9, the $\mathcal{L}(\mathbf{x})$ can be rewritten as:

$\mathcal{L}(\mathbf{x}) = \mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(\mathbf{x|z}) + \log p(\mathbf{z}|c) + \log p(c) - \log q(\mathbf{z|x}) - \log q(c|\mathbf{x})]$

We compute $\mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(\mathbf{x|z})]$ as:

$\begin{array}{ll} \mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(\mathbf{x|z})] &= \frac{1}{L}\sum_{l=1}^L \log(\mathbf{x}|\mathbf{z}^{(l)}) \\ &=\frac{1}{L}\sum_{l=1}^L\frac{1}{(2\pi)^{D/2}|\sigma_x^{(l)}|^{1/2}}\exp(-\frac{1}{2}(x - \mu_x^{(l)})(\sigma_x^{(l)})^{-1}(x - \mu_x^{(l)})) \end{array}$

Where $[\mathbf{\mu_x};\log \mathbf{\sigma_x}^2] = f(\mathbf{z}:\mathbf{\theta})$

We compute $\mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(\mathbf{z}|c)]$ as:

$\begin{array}{ll} \mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(\mathbf{z}|c)] &= \int_{\mathbf{z}}\sum_{c=1}^Kq(c|\mathbf{z})q(\mathbf{z}|\mathbf{x}) \log{p(\mathbf{z}|c)}d\mathbf{z} \\ &=\sum_{c=1}^Kq(c|\mathbf{x})\int_{\mathbf{z}}\mathcal{N}(\mathbf{z|\tilde{\mu}, \tilde{\sigma^2}I})\log{\mathcal{N}(\mathbf{z|\mu_c, \sigma_c^2I})}\\ &= -\sum_{c=1}^Kq(c|\mathbf{x})[\frac{J}{2}\log{(2\pi)} \\ &+ \frac{1}{2}(\sum_{j=1}^J\log\sigma^2_{c_j} +\sum_{j=1}^J\frac{\tilde{\sigma_j}^2}{\sigma_{c_j}^2} + \sum_{j=1}^J\frac{(\tilde{\mu}_j -\mu_{cj})^2}{\sigma_{c_j}^2})] \end{array}$

We compute $\mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(c)]$ as:

$\begin{array}{ll} \mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(c)] &= \int_{\mathbf{z}}\sum_{c=1}^Kq(\mathbf{z}|\mathbf{x})q(c|\mathbf{x}) \log{p(c)}d\mathbf{z} \\ &= \int_zq(\mathbf{z}|\mathbf{x})\sum_{c=1}^K q(c|\mathbf{x}) \log{\pi_c} d\mathbf{z} \\ &= \sum_{c=1}^Kq(c|\mathbf{x}) \log \pi_c \end{array}$

Following VAE, we use a neural network $g$ to model $q(\mathbf{z|x})$ :

$[\tilde{\mu}, \log \tilde{\sigma}^2] = g(\mathbf{x};\phi)$

$q(\mathbf{z|x}) = \mathcal{N}(\mathbf{z;\tilde{\mu}, \tilde{\sigma}^2I})$

Then, we compute $\mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(\mathbf{z}|\mathbf{x})]$ as:

$\begin{array}{ll} \mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(\mathbf{z}|\mathbf{x})] &= \int_{\mathbf{z}}\sum_{c=1}^Kq(c|\mathbf{x})q(\mathbf{z}|\mathbf{x}) \log{p(\mathbf{z|x})}d\mathbf{z} \\ &=\int_{\mathbf{z}}\mathcal{N}(\mathbf{z|\tilde{\mu}, \tilde{\sigma^2}I})\log{\mathcal{N}(\mathbf{z|\tilde{\mu}, \tilde{\sigma^2}I})}d\mathbf{z}\\ &= -\frac{J}{2}\log(2\pi) -\frac{1}{2}\sum_{j=1}^J(1+ \log\tilde{\sigma}^2_j) \end{array}$

We compute $\mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(c|\mathbf{x})]$ as:

$\begin{array}{ll} \mathit{E}_{q(\mathbf{z},c | \mathbf{x})}[\log p(c|\mathbf{x})] &= \int_{\mathbf{z}}\sum_{c=1}^Kq(\mathbf{z}|\mathbf{x})q(c|\mathbf{x}) \log{q(c|\mathbf{x})}d\mathbf{z} \\ &= \int_zq(\mathbf{z}|\mathbf{x})\sum_{c=1}^K q(c|\mathbf{x}) \log{q(c|\mathbf{x})}d\mathbf{z} \\ &= \sum_{c=1}^Kq(c|\mathbf{x}) \log{q(c|\mathbf{x})} \end{array}$

As for $q(c|\mathbf{x})$ , according to Equation 10, we compute as follow:

$q(c|\mathbf{x}) = p(c|\mathbf{z}) = \frac{1}{L}\sum_{l=1}^L\frac{p(c)p(\mathbf{z^{(l)}})|c}{\sum{c'=1}^Kp(c')p(\mathbf{z}|c')}$

Using the SGVB estimator and the reparameterization trick, the $\mathcal{L}_{ELBO}(\mathbf{x})$ can be rewritten as:

$\begin{array}{ll} \mathcal{L}(\mathbf{x})=&\frac{1}{L}\sum_{l=1}^L\frac{1}{(2\pi)^{D/2}|\sigma_x^{(l)}|^{1/2}}\exp(-\frac{1}{2}(x - \mu_x^{(l)})(\sigma_x^{(l)})^{-1}(x - \mu_x^{(l)}))\\ & - \frac{1}{2} \sum_{c=1}^K \gamma_c\sum{j=1}^{J}(\log \sigma_c^2|j + \frac{\tilde{\sigma}^2|j}{\sigma_c^2|j} + \frac{(\tilde{\mu}|j - \mu_c|j)^2}{\sigma_c^2|j}) \\ & + \sum_{c=1}^K \gamma_c \log \frac{\pi_c}{\gamma_c} + \frac{1}{2}\sum{j=1}^J(1 + \log \tilde{\sigma}^2|_j) \end{array}$

where $L$ is batch size, $D$ is the dimensionality of $\mathbf{x}$ and $\mu_x^{(l)}$ , $J$ is the dimensionality of $\mu_c,\sigma_c^2, \tilde{\mu}$ and $\tilde{\sigma}^2$ , and $*|_j$ donotes the $j^{th}$ element of $*$ , $K$ is the number of clusters, $\pi_c$ is the prior probability of cluster $c$ , and $\gamma_c$ denotes $q(c|\mathbf{x})$ . where $\mathbf{z}^{(l)}$ is the $l^{th}$ sample.