Here we describe the generative process of VDGEC. With clusters, an observed sample is generated by the following process:
compute :
choose a sample , i.e.
where is a predefined parameter, is the prior probability for cluster , . is the categorical distribution parameterized by . and are the mean and the variance of the Gaussian distribution corresponding to cluster . is an identity matrix, is a neural network whose input is and is parameterized by . is multivariate Gaussian distribution parameterized by .
According to the generative process above, the joint probability can be factorized as:
as and c are independent conditioned on .
To maximize the likelihood of the given data points base on the generative process, by using Jensen's inequality, the log-likelihood can be written as:
where is the evidence lower bound, which can be written as:
or as:
where is the variational posterior to approximate the true posterior .
We assume to be a meanfield distribution which can be factorized as:
In Equation 7, the first term has no relationship with and the second term is non-negative.
Hence, in order to maximize , should be satisfied, which means,
In Equation 8, the first term is the reconstruction term, which encourages VDGEC to explain the dataset well. And the second term is the Kullback-Leibler divergence from the gaussian mixture model prior to the variational posterior , which regularizes the latent embedding to lie on a mixture of gaussion manifold.
According to Equation 5 and 9, the can be rewritten as:
We compute as:
Where
We compute as:
We compute as:
Following VAE, we use a neural network to model :
Then, we compute as:
We compute as:
As for , according to Equation 10, we compute as follow:
Using the SGVB estimator and the reparameterization trick, the can be rewritten as:
where is batch size, is the dimensionality of and , is the dimensionality of and , and donotes the element of , is the number of clusters, is the prior probability of cluster , and denotes . where is the sample.