Combine VAE

Here we describe the generative process of VDGEC. With clusters, an observed sample is generated by the following process:

  1. choose a cluster , i.e.
  1. choose a latent vector , i.e.
  1. compute :

  2. choose a sample , i.e.

where is a predefined parameter, is the prior probability for cluster , . is the categorical distribution parameterized by . and are the mean and the variance of the Gaussian distribution corresponding to cluster . is an identity matrix, is a neural network whose input is and is parameterized by . is multivariate Gaussian distribution parameterized by .

According to the generative process above, the joint probability can be factorized as:

as and c are independent conditioned on .

 

To maximize the likelihood of the given data points base on the generative process, by using Jensen's inequality, the log-likelihood can be written as:

where is the evidence lower bound, which can be written as:

or as:

where is the variational posterior to approximate the true posterior .

We assume to be a meanfield distribution which can be factorized as:

In Equation 7, the first term has no relationship with and the second term is non-negative.

Hence, in order to maximize , should be satisfied, which means,

In Equation 8, the first term is the reconstruction term, which encourages VDGEC to explain the dataset well. And the second term is the Kullback-Leibler divergence from the gaussian mixture model prior to the variational posterior , which regularizes the latent embedding to lie on a mixture of gaussion manifold.

According to Equation 5 and 9, the can be rewritten as:

We compute as:

Where

 

We compute as:

We compute as:

Following VAE, we use a neural network to model :

Then, we compute as:

We compute as:

As for , according to Equation 10, we compute as follow:

Using the SGVB estimator and the reparameterization trick, the can be rewritten as:

where is batch size, is the dimensionality of and , is the dimensionality of and , and donotes the element of , is the number of clusters, is the prior probability of cluster , and denotes . where is the sample.