Variational Autoencoders Explained in Detail

The model is composed of three sub-networks:
- Given $x$ (image), encode it into a distribution over the latent space – referred to as $Q(z|x)$ in the previous post.
- Given $z$ in latent space (code representation of an image), decode it into the image it represents – referred to as $f(z)$ in the previous post.
- Given $x$, classify its digit by mapping it to a layer of size 10 where the i’th value contains the probability of the i’th digit.
The first two sub-networks are the vanilla VAE framework.
The third one is used as an auxiliary task, which will enforce some of the latent dimensions to encode the digit found in an image. Let me explain the motivation: in the previous post I explained that we don’t care what information each dimension of the latent space holds. The model can learn to encode whatever information it finds valuable for its task. Since we’re familiar with the dataset, we know the digit type should be important. We want to help the model by providing it with this information. Moreover, we’ll use this information to generate images conditioned on the digit type, as I’ll explain later.
Given the digit type, we’ll encode it using one hot encoding, that is, a vector of size 10. These 10 numbers will be concatenated into the latent vector, so when decoding that vector into an image, the model will make use of the digit information.
There are two ways to provide the model with a one hot encoding vector:
- Add it as an input to the model.
- Add it as a label so the model will have to predict it by itself: we’ll add another sub-network that predicts a vector of size 10 where the loss is the cross entropy with the expected one hot vector.
We’ll go with the second option. Why? Well, in test time we can use the model in two ways:
- Provide an image as input, and infer a latent vector.
- Provide a latent vector as input, and generate an image.
Since we want to support the first option too, we can’t provide the model with the digit as input, since we won’t know it in test time. Hence, the model must learn to predict it.
Now that we understand all the sub-networks composing the model, we can code them. The mathematical details behind the encoder and decoder can be found in the previous post.




