Data scientists often choose the uniform distribution or the normal distribution as the latent variable distribution when they build representative models of datasets. For example, the studies of the GANs [1] and the VAEs[2] used the uniform random distribution and the normal one, respectively.
As the approximate function implemented by neural networks is usually continuous, the topological structure of the latent variable distribution is preserved after the transformation from the latent variables space to the observable variables space. Given that the observed variables are distributed on a torus and that networks, for example the GANs, are trained with the latent variables sampled from the normal distribution, the structure of the projected distribution by the trained networks does not meet with the torus, even if residual error is small enough. Imagine another example where the observable variables follow a mixture distribution, of which clusters separate each other, trained variational autoencoder can encode the feature on the latent variable space with high precision, however, the decoded distribution consists of connected clusters since the latent variable is topologically equal with the ball. This means that the topology of the given dataset is not represented by the projection of the trained networks.
In this short text, we study the consequence of autoencoders' training due to the topological mismatch. We use the SAE[4] as autoencoders, which is enhanced based on the WAE[3] owing to the sinkhorn algorithm.
Agents consist of encoder and decoder networks. The encoder networks transform observable variables into latent variables and the decoder networks reverse the latent variables into the represented observable variables.
In our case studies, we define these two networks as the multilayer perceptron with the following hyperparameters: number of units nH
, number of layers nLayer
and activation function activation
, where the encoder and decoder networks share the same hyperparameters. The values of the hyperparameters are defined in each case study.
Environments generate datasets sampled from distributions of observable variables. The distributions are defined in each case study.
In all the case studies, we adopt Sinkhorn AutoEncoder (SAE) objective[4] as training criteria. The regularization parameter reg_param
is defined in each case study.
This case study builds representative models of a two-dimensional 1-torus by using the autoencoder with the latent variables sampled from the two dimensional uniform distribution. We show an example of the consequence caused by the topological mismatch between the observable and latent variables distribution.
Models are trained by using the hyperparameters shown in the table 3.1.1. The figure 3.1.1 (a) and (b) show the learning curves of the following training performances, respectively.
- Representative error,
mean((Y-Yhat)^2)
, whereY and Yhat
are the original observed variables and the represented ones, respectively. - Discrepancy between the referenced distribution of the latent variables and the projected ones by the trained encoder. Note that the discrepancy is measured by the absolute norm Wasserstein distance.
The learning curves tell us that the training has converged at the end of the training iterations.
The figure 3.1.2(a) (or the figure 3.1.2(b)) shows the images projected through the encoder (or decoder) of the trained model which has the average performance among the trained models with nLayer=7
. The left one is the input image of the observed (or latent) variables approximated by an analytical function and the right one is obtained by projecting the input image via the trained encoder(or decoder). Here are our findings.
- The learning curves in the figure 3.1.1(a) and (b) tell us that the projected samples can match well with the original samples and the distribution of the latent variables looks like the uniform distribution.
- The figure 3.1.2(a) shows that the hole in the latent variables image is fairly small and that the referenced latent variables distribution is almost covered by the projected one.
- Seeing the figure 3.1.2(b), the decoder's projected image is topologically identified with the disk, even though the region around the hole is stretched.
The last two findings say that the encoder and decoder as maps between the observable variables and the latent ones cannot preserve the topological structure. This might cause practical problems. For example, if you optimize a function defined on a 1-torus and if you plan to parameterize the decision variables on the torus by using the latent variables defined by autoencoder, it might be possible that you find a solution at a point of the hole of the torus, which is of course infeasible, because it exists a certain area in the latent variable which can be mapped on the hole of the torus.
Table 3.1.1. Hyper parameters
name | description | value |
---|---|---|
nEpoch | the number of epochs | 512 |
nBatch | the sample size of a single batch | 512 |
nH | the number of units of the encoder and decoder network | 512 |
nLayer | the number of layers of the encoder and decoder network | 3, 5 and 7 |
reg_param | the regularization parameter | 10 |
We move on to the next case study to see another type of topological mismatch: the one distributes on the twisted surface in the three dimensional space, while the distribution of the other, not twisted. It's impossible that the autoencoders consilliate this difference since the twisted image (or not twisted) is mapped on to the twisted image (or not twisted). We see the consequence of the autoencoders' training subject to this topological mismatch.
Here is the specifications of our experiment.
The environment generates the dataset sampled randomly from the mobius band.
More precisely say that the variables x, y and z
in the three dimensional space randomly distribute on the surface defined in
site
.
On the other hand,
we define the agent that the distribution of the latent variables u, v and w
follows
the uniform random distribution over a ring as follow:
Note that the observable variables' distribution is twisted, while the latent variables' one is not.
We train agents by using the hyperparameters in the table 3.2.1 and the figure 3.2.1 shows the learning curves of the pair of performances mentioned already in the case study #1. It tells us that the training has converged at the end of the final epoch. We see below in detail an agent among trained agents around the average performance.
The figure 3.2.2(a) shows how the trained encoder maps the observable variable image (the blue in the left) to the latent variable image (the blue one in the right). The projected image on the latent variable space represents well the referenced image(the gray one). Particularly, the part mapped from the twisted part of the input image is pushed and piled up on the surface of the referenced image due to the fact that the encoder cannot untangle the distortion of the mobius band.
The figure 3.2.2(b) shows the referenced latent variable image(the red one in the left) and its projected image on the observable variable space(the red one in the right) by the trained decoder. As mentioned in the case of the encoder, the projected image looks like the referenced observable image(the gray one in the right), though, since the referenced latent image is not twisted, a part of projected image is stretched and flipped in to fit to the twisted part of the mobius band. This happens because the decoder has to preserve the topological structure.
Thus, however well autoencoders regenerate datasets of distributions with complex structures in the data-driven manner, they cannot represent topological structure.
Table 3.2.1. Hyper parameters
name | description | value |
---|---|---|
nEpoch | the number of epochs | 512 |
nBatch | the sample size of a single batch | 512 |
nH | the number of units of the encoder and decoder network | 128 |
nLayer | the number of layers of the encoder and decoder network | 3 |
reg_param | the regularization parameter | 10 |
In the third example, we take account of the difference of knots as an example of topological mismatch. We try to represent a type of knot by transforming another type of knot and we see what happens to the autoencoders' training due to this discrepancy.
Here is the configuration of agents and environments:
- The environment randomly samples values from a trefoil knot in three-dimensional space.
- The latent variables of the agents are sampled from an unknot, namely a simple ring, in three-dimensional space.
The table 3.3.1 shows the hyperparameter set for the training. Note that small batch size is required in this training, probably because a smaller size batch can break better the symmetry across the x-y plane which the observable variables distribution holds. The training has already saturated at the end of epochs, which is confirmed in the learning curves shown in the figure 3.3.1.
We select one trained agent among the trained agents around average performances and analyze it.
- The figure 3.3.2(a) shows the transformation of the observable variables distribution on the latent variables space by the trained encoder Although the output image is close to the referenced latent variables distribution , the output image seemingly preserves the knots of the original image.
- The figure 3.3.2(b) shows the referenced latent variables image, that is the simple circle, and its projected image by the trained decoder on the observable variables space. The output closed loop is approaching the original trefoil knot, However, it's hard to fit it completely because the projected image cannot make new knots on their own.
In this way, even if the numerical evaluations of the error and the discrepancy of distributions are small, the autoencoders are not capable of creating new knots. The topological discrepancy cannot be resolved just by the autoencoders.
Table 3.3.1. Hyper parameters
name | description | value |
---|---|---|
nEpoch | the number of epochs | 512 |
nBatch | the sample size of a single batch | 32 |
nH | the number of units of the encoder and decoder network | 32 |
nLayer | the number of layers of the encoder and decoder network | 3 |
reg_param | the regularization parameter | 0.1 |
activation | activation function of agent | tanh |
- [1]: Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio: “Generative Adversarial Networks”, 2014; arXiv:1406.2661.
- [2]: Diederik P Kingma, Max Welling: “Auto-Encoding Variational Bayes”, 2013; arXiv:1312.6114.
- [3]: Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Schoelkopf: “Wasserstein Auto-Encoders”, 2017; arXiv:1711.01558.
- [4]: Giorgio Patrini, Rianne van den Berg, Patrick Forré, Marcello Carioni, Samarth Bhargav, Max Welling, Tim Genewein, Frank Nielsen: “Sinkhorn AutoEncoders”, 2018; arXiv:1810.01118.