Autoencoders are feed-forward, non-recurrent neural networks that learn by unsupervised learning, also sometimes called semi-supervised learning, since the input is treated as the target too. In this chapter, you will learn and implement different variants of autoencoders and eventually learn how to stack autoencoders. We will also see how autoencoders can be used to create MNIST digits, and finally will also cover the steps involved in building an long short-term memory autoencoder to generate sentence vectors. This chapter includes the following topics:

- Vanilla autoencoders
- Sparse autoencoders
- Denoising autoencoders
- Convolutional autoencoders
- Stacked autoencoders
- Generating sentences using LSTM autoencoders

# Introduction to autoencoders

Autoencoders are a class of neural network that attempt to recreate the input as their target using back-propagation. An autoencoder consists of two parts; an encoder and a decoder. The encoder will read the input and compress it to a compact representation, and the decoder will read the compact representation and recreate the input from it. In other words, the autoencoder tries to learn the identity function by minimizing the reconstruction error. They have an inherent capability to learn a compact representation of data. They are at the center of deep belief networks and find applications in image reconstruction, clustering, machine translation, and much more.

You might think that implementing an identity function using deep neural networks is boring, however, the way in which this is done makes it interesting. The number of hidden units in the autoencoder is typically less than the number of input (and output) units. This forces the encoder to learn a compressed representation of the input, which the decoder reconstructs. If there is a structure in the input data in the form of correlations between input features, then the autoencoder will discover some of these correlations, and end up learning a low-dimensional representation of the data similar to that learned using **principal component analysis** (**PCA**).

While PCA uses linear transformations, autoencoders on the other hand use non-linear transformations.

Once the autoencoder is trained, we would typically just discard the decoder component and use the encoder component to generate compact representations of the input. Alternatively, we could use the encoder as a feature detector that generates a compact, semantically rich representation of our input and build a classifier by attaching a softmax classifier to the hidden layer.

The encoder and decoder components of an autoencoder can be implemented using either dense, convolutional, or recurrent networks, depending on the kind of data that is being modeled. For example, dense networks might be a good choice for autoencoders used to build **collaborative filtering** (**CF**) models where we learn a compressed model of user preferences based on actual sparse user ratings. Similarly, convolutional neural networks may be appropriate for the use case described in the article *iSee: Using Deep Learning to Remove Eyeglasses from Faces*, by M. Runfeldt. Recurrent networks, on the other hand, are a good choice for autoencoders working on text data, such as deep patient and skip-thought vectors.

We can think of autoencoders as consisting of two cascaded networks. The first network is an encoder, it takes the input *x*, and encodes it using a transformation *h* to an encoded signal *y*, that is:

*y= h(x)*

The second network uses the encoded signal *y* as its input and performs another transformation *f* to get a reconstructed signal *r*, that is:

*r= f(y) = f(h(x))*

We define error, *e*, as the difference between the original input *x* and the reconstructed signal *r, e= x- r*. The network then learns by reducing the loss function (for example **mean squared error** (**MSE**)), and the error is propagated backwards to the hidden layers as in the case of MLPs.

Depending upon the actual dimensions of the encoded layer with respect to the input, the loss function, and constraints, there are various types of autoencoders: Variational autoencoders, Sparse autoencoders, Denoising autoencoders, and Convolution autoencoders.

Autoencoders can also be stacked by successively stacking encoders that compress their input to smaller and smaller representations, then stacking decoders in the opposite sequence. Stacked autoencoders have greater expressive power and the successive layers of representations capture a hierarchical grouping of the input, similar to the convolution and pooling operations in convolutional neural networks.

Stacked autoencoders used to be trained layer by layer. For example, in the network shown next, we would first train layer **X** to reconstruct layer **X'** using the hidden layer **H1 **(ignoring **H2**). We would then train the layer **H1** to reconstruct layer **H1'** using the hidden layer **H2**. Finally, we would stack all the layers together in the configuration shown and fine tune it to reconstruct **X'** from **X**. With better activation and regularization functions nowadays, however, it is quite common to train these networks in totality:

Figure 1: Visualisation of stacked autoencoders

In this chapter, we will learn about these variations in autoencoders and implement them using TensorFlow 2.0.

# Vanilla autoencoders

The Vanilla autoencoder, as proposed by Hinton in his 2006 paper *Reducing the Dimensionality of Data with Neural Networks*, consists of one hidden layer only. The number of neurons in the hidden layer are less than the number of neurons in the input (or output) layer.

This results in producing a bottleneck effect in the flow of information in the network. The hidden layer in between is also called the "bottleneck layer." Learning in the autoencoder consists of developing a compact representation of the input signal at the hidden layer so that the output layer can faithfully reproduce the original input.

In the following diagram, you can see the architecture of Vanilla autoencoder:

Figure 2: Architecture of the Vanilla autoencoder, visualized

Let us try to build a Vanilla autoencoder. While in the paper Hinton used it for dimension reduction, in the code to follow we will use autoencoders for image reconstruction. We will train the autoencoder on the MNIST database and will use it to reconstruct the test images. In the code, we will use the TensorFlow Keras `Layers`

class to build our own encoder and decoder layers, so firstly let's learn a little about the `Layers`

class.

## TensorFlow Keras layers ‒ defining custom layers

TensorFlow provides an easy way to define your own custom layer both from scratch or as a composition of existing layers. The TensorFlow Keras `layers`

package defines a `Layers`

object. We can make our own layer by simply making it a child class of the `Layers`

class. It is necessary to define the dimensions of the output while defining the layer. Though input dimensions are optional, if you do not define them, it will infer it automatically from the data. To build our own layer we will need to implement three methods:

`__init__()`

: Here, you define all input-independent initializations.`build()`

: Here, we define the shapes of input tensors and can perform rest initializations if required. In our example, since we are not explicitly defining input shapes, we need not define`build()`

method.`call()`

: This is where the forward computation is performed.

Using the `tensorflow.keras.layers`

class we now define the encoder and decoder layers. First let's start with the encoder layer. We import `tensorflow.keras`

as `K`

, and create an `Encoder`

class. The Encoder takes in the input and generates the hidden or the bottleneck layer as the output:

```
class Encoder(K.layers.Layer):
def __init__(self, hidden_dim):
super(Encoder, self).__init__()
self.hidden_layer = K.layers.Dense(units=hidden_dim, activation=tf.nn.relu)
def call(self, input_features):
activation = self.hidden_layer(input_features)
return activation
```

Next we define the `Decoder`

class; this class takes in the output from the `Encoder`

and then passes it through a fully connected neural network. The aim is to be able to reconstruct the input to the `Encoder`

:

```
class Decoder(K.layers.Layer):
def __init__(self, hidden_dim, original_dim):
super(Decoder, self).__init__()
self.output_layer = K.layers.Dense(units=original_dim, activation=tf.nn.relu)
def call(self, encoded):
activation = self.output_layer(encoded)
return activation
```

Now that we have both encoder and decoder defined we use the `tensorflow.keras.Model`

object to build the autoencoder model. You can see in the following code that in the `__init__()`

function we instantiate the encoder and decoder objects and in the `call()`

method we define the signal flow. Also notice the member list `self.loss`

initialized in the `_init__()`

:

```
class Autoencoder(K.Model):
def __init__(self, hidden_dim, original_dim):
super(Autoencoder, self).__init__()
self.loss = []
self.encoder = Encoder(hidden_dim=hidden_dim)
self.decoder = Decoder(hidden_dim=hidden_dim, original_dim=original_dim)
def call(self, input_features):
encoded = self.encoder(input_features)
reconstructed = self.decoder(encoded)
return reconstructed
```

In the next section we will use the autoencoder that we defined here to reconstruct handwritten digits.

## Reconstructing handwritten digits using an autoencoder

Now that we have our model autoencoder with its layer encoder and decoder ready, let us try to reconstruct handwritten digits. The complete code is available in the GitHub repo of the chapter in the notebook `VanillaAutoencoder.ipynb`

. The code will require the NumPy, TensorFlow, and Matplotlib modules:

```
import numpy as np
import tensorflow as tf
import tensorflow.keras as K
import matplotlib.pyplot as plt
```

Before starting with the actual implementation, let's also define some hyperparameters. If you play around with them, you will notice that even though the architecture of your model remains the same, there is a significant change in model performance. Hyperparameter tuning (refer to *Chapter 1*, *Neural Network Foundations with TensorFlow 2.0,* for more details) is one of the important steps in deep learning. For reproducibility, we set the seeds for random calculation:

```
np.random.seed(11)
tf.random.set_seed(11)
batch_size = 256
max_epochs = 50
learning_rate = 1e-3
momentum = 8e-1
hidden_dim = 128
original_dim = 784
```

For training data, we are using the MNIST dataset available in the TensorFlow datasets. We normalize the data so that pixel values lie between [0,1]; this is achieved by simply dividing each pixel element by 255.

And then we reshape the tensors from 2D to 1D. We employ the `from_tensor_slices`

to generate slices of tensors. Also note that here we are not using one-hot encoded labels; this is the case because we are not using labels to train the network. Autoencoders learn via unsupervised learning:

```
(x_train, _), (x_test, _) = K.datasets.mnist.load_data()
x_train = x_train / 255.
x_test = x_test / 255.
x_train = x_train.astype(np.float32)
x_test = x_test.astype(np.float32)
x_train = np.reshape(x_train, (x_train.shape[0], 784))
x_test = np.reshape(x_test, (x_test.shape[0], 784))
training_dataset = tf.data.Dataset.from_tensor_slices(x_train).batch(batch_size)
```

Now we instantiate our autoencoder model object and define the loss and optimizers to be used for training. Observe the loss carefully; it is simply the difference between the original image and the reconstructed image. You may find that the term *reconstruction loss* is also used to describe it in many books and papers:

```
autoencoder = Autoencoder(hidden_dim=hidden_dim, original_dim=original_dim)
opt = tf.keras.optimizers.Adam(learning_rate=1e-2)
def loss(preds, real):
return tf.reduce_mean(tf.square(tf.subtract(preds, real)))
```

Instead of using the auto-training loop, for our custom autoencoder model we will define a custom training. We use `tf.GradientTape`

to record the gradients as they are calculated and implicitly apply the gradients to all the trainable variables of our model:

```
def train(loss, model, opt, original):
with tf.GradientTape() as tape:
preds = model(original)
reconstruction_error = loss(preds, original)
gradients = tape.gradient(reconstruction_error, model.trainable_variables)
gradient_variables = zip(gradients, model.trainable_variables)
opt.apply_gradients(gradient_variables)
return reconstruction_error
```

The preceding `train()`

function will be invoked in a training loop, with the dataset fed to the model in batches:

```
def train_loop(model, opt, loss, dataset, epochs=20):
for epoch in range(epochs):
epoch_loss = 0
for step, batch_features in enumerate(dataset):
loss_values = train(loss, model, opt, batch_features)
epoch_loss += loss_values
model.loss.append(epoch_loss)
print('Epoch {}/{}. Loss: {}'.format(epoch + 1, epochs, epoch_loss.numpy()))
```

Let us now train our autoencoder:

```
train_loop(autoencoder, opt, loss, training_dataset, epochs=max_epochs)
```

The training graph is shown as follows. We can see that loss/cost is decreasing as the network learns and after 50 epochs it is almost constant about a line. This means further increasing the number of epochs will not be useful. If we want to improve our training further, we should change the hyperparameters like learning rate and `batch_size`

:

```
plt.plot(range(max_epochs), autoencoder.loss)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()
```

In the following figure, you can see the original (top) and reconstructed (bottom) images; they are slightly blurred, but accurate:

```
number = 10 # how many digits we will display
plt.figure(figsize=(20, 4))
for index in range(number):
# display original
ax = plt.subplot(2, number, index + 1)
plt.imshow(x_test[index].reshape(28, 28), cmap='gray')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, number, index + 1 + number)
plt.imshow(autoencoder(x_test)[index].numpy().reshape(28, 28), cmap='gray')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
```

It is interesting to note that in the preceding code we reduced the dimensions of the input from 784 to 128 and our network could still reconstruct the original image. This should give you an idea of the power of the autoencoder for dimensionality reduction. One advantage of autoencoders over PCA for dimensionality reduction is that while PCA can only represent linear transformations, we can use non-linear activation functions in autoencoders, thus introducing non-linearities in our encodings:

Figure 3: Comparison between the result of a PCA

The preceding figure is reproduced from the Hinton paper *Reducing the dimensionality of data with Neural Networks*. It compares the result of a PCA (A) with that of stacked autoencoders with architecture consisting of 784-1000-500-250-2.

You can see that the colored dots on the right are nicely separated, thus stacked autoencoders are giving much better results compared to PCA. Now that you are familiar with Vanilla autoencoders, let us see different variants of autoencoders and their implementation details.

# Sparse autoencoder

The autoencoder we covered in the previous section works more like an identity network; it simply reconstructs the input. The emphasis is to reconstruct the image at the pixel level, and the only constraint is the number of units in the bottleneck layer. While it is interesting, pixel-level reconstruction does not ensure that the network will learn abstract features from the dataset. We can ensure that a network learns abstract features from the dataset by adding further constraints.

In Sparse autoencoders, a sparse penalty term is added to the reconstruction error. This tries to ensure that fewer units in the bottleneck layer will fire at any given time. We can include the sparse penalty within the encoder layer itself. In the following code, you can see that the `Dense`

layer of the `Encoder`

now has an additional parameter, `activity_regularizer`

:

```
class SparseEncoder(K.layers.Layer):
def __init__(self, hidden_dim):
super(Encoder, self).__init__()
self.hidden_layer = K.layers.Dense(units=hidden_dim, activation=tf.nn.relu, activity_regularizer=regularizers.l1(10e-5))
def call(self, input_features):
activation = self.hidden_layer(input_features)
return activation
```

The activity regularizer tries to reduce the layer output (refer to *Chapter 1*, *Neural Network Foundations with TensorFlow 2.0*). It will reduce both weights and bias of the fully connected layer to ensure that the output is as small as it can be. TensorFlow supports three types of `activity_regularizer`

:

`l1`

: Here the activity is computed as the sum of absolute values`l2`

: The activity here is calculated as the sum of the squared values`l1_l2`

: This includes both L1 and L2 terms

Keeping the rest of the code the same, and just changing the encoder, you can get the Sparse autoencoder from the Vanilla autoencoder. The complete code for Sparse autoencoder is in the Jupyter Notebook `SparseAutoencoder.ipynb`

.

Alternatively, you can explicitly add a regularization term for sparsity in the loss function. To do so you will need to implement the regularization for the sparsity term as a function. If *m* is the total number of input patterns, then we can define a quantity *ρ_hat* (you can check the mathematical details in Andrew Ng's lecture here: https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf), which measures the net activity (how many times on average it fires) for each hidden layer unit. The basic idea is to put a constraint *ρ_hat*, such that it is equal to the sparsity parameter ρ. This results in adding a regularization term for sparsity in the loss function so that now the loss function becomes:

*loss = Mean squared error + Regularization for sparsity parameter*

This regularization term will penalize the network if *ρ_hat* deviates from *ρ*. One standard way to do this is to use **Kullback-Leiber** (**KL**) divergence (you can learn more about KL divergence from this interesting lecture: https://www.stat.cmu.edu/~cshalizi/754/2006/notes/lecture-28.pdf) between *ρ* and *ρ_hat*.

Let's explore the KL divergence, *D*_{KL}, a little more. It is a non-symmetric measure of the difference between the two distributions, in our case, *ρ* and *ρ_hat*. When *ρ* and *ρ_hat* are equal then the difference is zero, otherwise it increases monotonically as *ρ_hat* diverges from *ρ*. Mathematically, it is expressed as:

You add this to the loss to implicitly include the sparse term. You will need to fix a constant value for the sparsity term *ρ* and compute *ρ_hat* using the encoder output.

The compact representation of the inputs is stored in weights. Let us visualize the weights learned by the network. Following are the weights of the encoder layer for the standard and Sparse autoencoder respectively.

We can see that in the standard autoencoder (a) many hidden units have very large weights (brighter), suggesting that they are overworked, while all the hidden units of the Sparse autoencoder (b) learn the input representation almost equally, and we see a more even color distribution:

Figure 4: Encoder weight matrix for (a) Standard Autoencoder and (b) Sparse Autoencoder

# Denoising autoencoders

The two autoencoders that we have covered in the previous sections are examples of undercomplete autoencoders, because the hidden layer in them has lower dimensionality as compared to the input (output) layer. Denoising autoencoders belong to the class of overcomplete autoencoders, because they work better when the dimensions of the hidden layer are more than the input layer.

A denoising autoencoder learns from a corrupted (noisy) input; it feed its encoder network the noisy input, and then the reconstructed image from the decoder is compared with the original input. The idea is that this will help the network learn how to denoise an input. It will no longer just make pixel-wise comparisons, but in order to denoise it will learn the information of neighboring pixels as well.

A Denoising autoencoder has two main differences from other autoencoders: first, `n_hidden`

, the number of hidden units in the bottleneck layer is greater than the number of units in the input layer, `m`

, that is, `n_hidden`

> `m`

. Second, the input to the encoder is corrupted input. To do this we add a noise term in both test and training images:

```
noise = np.random.normal(loc=0.5, scale=0.5, size=x_train.shape)
x_train_noisy = x_train + noise
noise = np.random.normal(loc=0.5, scale=0.5, size=x_test.shape)
x_test_noisy = x_test + noise
x_train_noisy = np.clip(x_train_noisy, 0., 1.)
x_test_noisy = np.clip(x_test_noisy, 0., 1.)
```

## Clearing images using a Denoising autoencoder

Let us use the Denoising autoencoder to clear the handwritten MNIST digits.

- We start with importing the required modules:
`import numpy as np import tensorflow as tf import tensorflow.keras as K import matplotlib.pyplot as plt`

- Next we define the hyperparameters for our model:
`np.random.seed(11) tf.random.set_seed(11) batch_size = 256 max_epochs = 50 learning_rate = 1e-3 momentum = 8e-1 hidden_dim = 128 original_dim = 784`

- We read in the MNIST dataset, normalize it, and introduce noise in it:
`(x_train, _), (x_test, _) = K.datasets.mnist.load_data() x_train = x_train / 255. x_test = x_test / 255. x_train = x_train.astype(np.float32) x_test = x_test.astype(np.float32) x_train = np.reshape(x_train, (x_train.shape[0], 784)) x_test = np.reshape(x_test, (x_test.shape[0], 784)) # Generate corrupted MNIST images by adding noise with normal dist # centered at 0.5 and std=0.5 noise = np.random.normal(loc=0.5, scale=0.5, size=x_train.shape) x_train_noisy = x_train + noise noise = np.random.normal(loc=0.5, scale=0.5, size=x_test.shape) x_test_noisy = x_test + noise`

- We use the same encoder, decoder, and autoencoder classes as defined in the
*Vanilla autoencoders*section:`# Encoder class Encoder(K.layers.Layer): def __init__(self, hidden_dim): super(Encoder, self).__init__() self.hidden_layer = K.layers.Dense(units=hidden_dim, activation=tf.nn.relu) def call(self, input_features): activation = self.hidden_layer(input_features) return activation # Decoder class Decoder(K.layers.Layer): def __init__(self, hidden_dim, original_dim): super(Decoder, self).__init__() self.output_layer = K.layers.Dense(units=original_dim, activation=tf.nn.relu) def call(self, encoded): activation = self.output_layer(encoded) return activation class Autoencoder(K.Model): def __init__(self, hidden_dim, original_dim): super(Autoencoder, self).__init__() self.loss = [] self.encoder = Encoder(hidden_dim=hidden_dim) self.decoder = Decoder(hidden_dim=hidden_dim, original_dim=original_dim) def call(self, input_features): encoded = self.encoder(input_features) reconstructed = self.decoder(encoded) return reconstructed`

- Next we create the model and define the loss and optimizers to be used. Notice that this time instead of writing the custom training loop we are using the easier Keras inbuilt
`compile()`

and`fit()`

methods:`model = Autoencoder(hidden_dim=hidden_dim, original_dim=original_dim) model.compile(loss='mse', optimizer='adam') loss = model.fit(x_train_noisy, x_train, validation_data=(x_test_noisy, x_test), epochs=max_epochs, batch_size=batch_size)`

- Now let's plot the training loss:
`plt.plot(range(max_epochs), loss.history['loss']) plt.xlabel('Epochs') plt.ylabel('Loss') plt.show()`

- And finally, let's see our model in action. The top row shows the input noisy image and the bottom row shows cleaned images produced from our trained Denoising autoencoder:
`number = 10 # how many digits we will display plt.figure(figsize=(20, 4)) for index in range(number): # display original ax = plt.subplot(2, number, index + 1) plt.imshow(x_test_noisy[index].reshape(28, 28), cmap='gray') ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) # display reconstruction ax = plt.subplot(2, number, index + 1 + number) plt.imshow(model(x_test_noisy)[index].numpy().reshape(28, 28), cmap='gray') ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()`

An impressive reconstruction of images from noisy images, I'm sure you'll agree. You can access the code in the notebook `DenoisingAutoencoder.ipynb`

if you want to play around with it.

# Stacked autoencoder

Until now we have restricted ourselves to autoencoders with only one hidden layer. We can build Deep autoencoders by stacking many layers of both encoder and decoder; such an autoencoder is called a Stacked autoencoder. The features extracted by one encoder are passed on to the next encoder as input. The stacked autoencoder can be trained as a whole network with an aim to minimize the reconstruction error. Or each individual encoder/decoder network can first be pretrained using the unsupervised method you learned earlier, and then the complete network can be fine-tuned. When the deep autoencoder network is a convolutional network, we call it a **Convolutional Autoencoder**. Let us implement a convolutional autoencoder in TensorFlow 2.0 next.

## Convolutional autoencoder for removing noise from images

In the previous section we reconstructed handwritten digits from noisy input images. We used a fully connected network as the encoder and decoder for the work. However, we know that for images, a convolutional al network can give better results, so in this section we will use a convolution network for both the encoder and decoder. To get better results we will use multiple convolution layers in both the encoder and decoder networks; that is, we will make stacks of convolutional layers (along with maxpooling or upsample layers). We will also be training the entire autoencoder as a single entity.

- We import all the required modules; also for convenience import specific layers from
`tensorflow.keras.layers`

:`import numpy as np import tensorflow as tf import tensorflow.keras as K import matplotlib.pyplot as plt from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, UpSampling2D`

- We specify our hyperparameters. If you look carefully, the list is slightly different; as compared to earlier autoencoder implementations, instead of learning rate and momentum, this time we are concerned with filters of the convolutional layer:
`np.random.seed(11) tf.random.set_seed(11) batch_size = 128 max_epochs = 50 filters = [32,32,16]`

- In the next step, we read in the data and preprocess it. Again, you may observe slight variation from the previous code, especially in the way we are adding noise and then limiting the range in between [0-1]. We are doing so because in this case, instead of the mean square error loss, we will be using binary cross entropy loss and the final output of the decoder will pass through sigmoid activation, restricting it between [0-1]:
`(x_train, _), (x_test, _) = K.datasets.mnist.load_data() x_train = x_train / 255. x_test = x_test / 255. x_train = np.reshape(x_train, (len(x_train),28, 28, 1)) x_test = np.reshape(x_test, (len(x_test), 28, 28, 1)) noise = 0.5 x_train_noisy = x_train + noise * np.random.normal(loc=0.0, scale=1.0, size=x_train.shape) x_test_noisy = x_test + noise * np.random.normal(loc=0.0, scale=1.0, size=x_test.shape) x_train_noisy = np.clip(x_train_noisy, 0, 1) x_test_noisy = np.clip(x_test_noisy, 0, 1) x_train_noisy = x_train_noisy.astype('float32') x_test_noisy = x_test_noisy.astype('float32') #print(x_test_noisy[1].dtype)`

- Let us now define our encoder. The encoder consists of three convolutional layers, each followed by a max pooling layer. Since we are using the MNIST dataset the shape of the input image is 28 × 28 (single channel) and the output image is of size 4 × 4 (and since the last convolutional layer has 16 filters, the image has 16 channels):
`class Encoder(K.layers.Layer): def __init__(self, filters): super(Encoder, self).__init__() self.conv1 = Conv2D(filters=filters[0], kernel_size=3, strides=1, activation='relu', padding='same') self.conv2 = Conv2D(filters=filters[1], kernel_size=3, strides=1, activation='relu', padding='same') self.conv3 = Conv2D(filters=filters[2], kernel_size=3, strides=1, activation='relu', padding='same') self.pool = MaxPooling2D((2, 2), padding='same') def call(self, input_features): x = self.conv1(input_features) #print("Ex1", x.shape) x = self.pool(x) #print("Ex2", x.shape) x = self.conv2(x) x = self.pool(x) x = self.conv3(x) x = self.pool(x) return x`

- Next comes the decoder. It is the exact opposite of the encoder in design, and instead of max pooling we are using upsampling to increase the size back. Notice the commented print statements: you can use them to understand how the shape gets modified after each step. Also notice both encoder and decoder are still classes based on the TensorFlow Keras
`Layers`

class, but now they have multiple layers inside them. So now you know how to build a complex custom layer:`class Decoder(K.layers.Layer): def __init__(self, filters): super(Decoder, self).__init__() self.conv1 = Conv2D(filters=filters[2], kernel_size=3, strides=1, activation='relu', padding='same') self.conv2 = Conv2D(filters=filters[1], kernel_size=3, strides=1, activation='relu', padding='same') self.conv3 = Conv2D(filters=filters[0], kernel_size=3, strides=1, activation='relu', padding='valid') self.conv4 = Conv2D(1, 3, 1, activation='sigmoid', padding='same') self.upsample = UpSampling2D((2, 2)) def call(self, encoded): x = self.conv1(encoded) #print("dx1", x.shape) x = self.upsample(x) #print("dx2", x.shape) x = self.conv2(x) x = self.upsample(x) x = self.conv3(x) x = self.upsample(x) return self.conv4(x)`

- We combine the encoder and decoder to make an autoencoder model. This remains exactly the same as before:
`class Autoencoder(K.Model): def __init__(self, filters): super(Autoencoder, self).__init__() self.encoder = Encoder(filters) self.decoder = Decoder(filters) def call(self, input_features): #print(input_features.shape) encoded = self.encoder(input_features) #print(encoded.shape) reconstructed = self.decoder(encoded) #print(reconstructed.shape) return reconstructed`

- Now we instantiate our model, then specify the binary cross entropy as the loss function and Adam as the optimizer in the
`compile()`

method. Then, fit the model to the training dataset:`model = Autoencoder(filters) model.compile(loss='binary_crossentropy', optimizer='adam') loss = model.fit(x_train_noisy, x_train, validation_data=(x_test_noisy, x_test), epochs=max_epochs, batch_size=batch_size)`

- You can see the loss curve as the model is trained; in 50 epochs the loss was reduced to 0.0988:
`plt.plot(range(max_epochs), loss.history['loss']) plt.xlabel('Epochs') plt.ylabel('Loss') plt.show()`

- And finally, you can see the wonderful reconstructed images from the noisy input images:
`number = 10 # how many digits we will display plt.figure(figsize=(20, 4)) for index in range(number): # display original ax = plt.subplot(2, number, index + 1) plt.imshow(x_test_noisy[index].reshape(28, 28), cmap='gray') ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) # display reconstruction ax = plt.subplot(2, number, index + 1 + number) plt.imshow(tf.reshape(model(x_test_noisy)[index], (28, 28)), cmap='gray') ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()`

You can see that the images are much clearer and sharper relative to the previous autoencoders we have covered in this chapter. The code for this section is available in the Jupyter notebook, `ConvolutionAutoencoder.ipynb`

.

## Keras autoencoder example ‒ sentence vectors

In this example, we will build and train an LSTM-based autoencoder to generate sentence vectors for documents in the Reuters-21578 corpus (https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection). We have already seen in *Chapter 7*, *Word Embeddings*, how to represent a word using word embeddings to create vectors that represent the word's meaning in the context of other words it appears with. Here, we will see how to build similar vectors for sentences. Sentences are sequences of words, so a sentence vector represents the meaning of a sentence.

The easiest way to build a sentence vector is to just add up the word vectors and divide by the number of words. However, this treats the sentence as a bag of words, and does not take the order of words into account. Thus, the sentences *The dog bit the man* and *The man bit the dog* would be treated as identical in this scenario. LSTMs are designed to work with sequence input and do take the order of words into consideration thus providing a better and more natural representation of the sentence.

First, we import the necessary libraries:

```
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import RepeatVector
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import sequence
from scipy.stats import describe
import collections
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
from time import gmtime, strftime
from tensorflow.keras.callbacks import TensorBoard
import re
# Needed to run only once
nltk.download('punkt')
```

The data is provided as a set of SGML files. The helper code to convert the SGML files to `text.tsv`

which is based on Scikit-learn: https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html, is added in the GitHub the file named `parse.py`

. We will use the data from this file and first convert each block of text into a list of sentences, one sentence per line. Also, each word in the sentence is normalized as it is added. The normalization involves removing all numbers and replacing them with the number 9, then converting the word to lower case. Simultaneously we also calculate the word frequencies in the same code. The result is the word frequency table, `word_freqs`

:

```
DATA_DIR = "data"
def is_number(n):
temp = re.sub("[.,-/]", "",n)
return temp.isdigit()
# parsing sentences and building vocabulary
word_freqs = collections.Counter()
ftext = open(os.path.join(DATA_DIR, "text.tsv"), "r")
sents = []
sent_lens = []
for line in ftext:
docid, text = line.strip().split("\t")
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if is_number(word):
word = "9"
word = word.lower()
word_freqs[word] += 1
sents.append(sent)
sent_lens.append(len(sent))
ftext.close()
```

Let us use the preceding generated arrays to get some information about the corpus that will help us figure out good values for our constants for our LSTM network:

```
print("Total number of sentences are: {:d} ".format(len(sents)))
print ("Sentence distribution min {:d}, max {:d} , mean {:3f}, median {:3f}".format(np.min(sent_lens), np.max(sent_lens), np.mean(sent_lens), np.median(sent_lens)))
print("Vocab size (full) {:d}".format(len(word_freqs)))
```

This gives us the following information about the corpus:

**Total number of sentences are: 131545**
**Sentence distribution min 1, max 2434 , mean 120.525052, median 115.000000**
**Vocab size (full) 50743**

Based on this information, we set the following constants for our LSTM model. We choose our `VOCAB_SIZE`

as `5000`

; that is, our vocabulary covers the most frequent 5,000 words, which covers over 93% of the words used in the corpus. The remaining words are treated as **out of vocabulary** (**OOV**) and replaced with the token UNK. At prediction time, any word that the model hasn't seen will also be assigned the token UNK. `SEQUENCE_LEN`

is set to approximately twice the median length of sentences in the training set, and indeed, approximately 110 million of our 131 million sentences are shorter than this setting. Sentences that are shorter than `SEQUENCE_LENGTH`

will be padded by a special PAD character, and those that are longer will be truncated to fit the limit:

```
VOCAB_SIZE = 5000
SEQUENCE_LEN = 50
```

Since the input to our LSTM will be numeric, we need to build lookup tables that go back and forth between words and word IDs. Since we limit our vocabulary size to 5,000 and we have to add the two pseudo-words `PAD`

and `UNK`

, our lookup table contains entries for the most frequently occurring 4,998 words plus `PAD`

and `UNK`

:

```
word2id = {}
word2id["PAD"] = 0
word2id["UNK"] = 1
for v, (k, _) in enumerate(word_freqs.most_common(VOCAB_SIZE - 2)):
word2id[k] = v + 2
id2word = {v:k for k, v in word2id.items()}
```

The input to our network is a sequence of words, where each word is represented by a vector. Simplistically, we could just use a one-hot encoding for each word, but that makes the input data very large. So, we encode each word using its 50-dimensional GloVe embeddings.

The embedding is generated into a matrix of shape (`VOCAB_SIZE`

, `EMBED_SIZE`

) where each row represents the GloVe embedding for a word in our vocabulary. The `PAD`

and `UNK`

rows (`0`

and `1`

respectively) are populated with zeros and random uniform values respectively:

```
EMBED_SIZE = 50
def lookup_word2id(word):
try:
return word2id[word]
except KeyError:
return word2id["UNK"]
def load_glove_vectors(glove_file, word2id, embed_size):
embedding = np.zeros((len(word2id), embed_size))
fglove = open(glove_file, "rb")
for line in fglove:
cols = line.strip().split()
word = cols[0]
if embed_size == 0:
embed_size = len(cols) - 1
if word2id.has_key(word):
vec = np.array([float(v) for v in cols[1:]])
embedding[lookup_word2id(word)] = vec
embedding[word2id["PAD"]] = np.zeros((embed_size))
embedding[word2id["UNK"]] = np.random.uniform(-1, 1, embed_size)
return embedding
```

Next, we use these functions to generate embeddings:

```
sent_wids = [[lookup_word2id(w) for w in s.split()] for s in sents]
sent_wids = sequence.pad_sequences(sent_wids, SEQUENCE_LEN)
# load glove vectors into weight matrix
embeddings = load_glove_vectors(os.path.join(DATA_DIR, "glove.6B.{:d}d.txt".format(EMBED_SIZE)), word2id, EMBED_SIZE)
```

Our autoencoder model takes a sequence of GloVe word vectors and learns to produce another sequence that is similar to the input sequence. The encoder LSTM compresses the sequence into a fixed-size context vector, which the decoder LSTM uses to reconstruct the original sequence. A schematic of the network is shown here:

Figure 5: Visualisation of the LSTM network

Because the input is quite large, we will use a generator to produce each batch of input. Our generator produces batches of tensors of shape (`BATCH_SIZE`

, `SEQUENCE_LEN`

, `EMBED_SIZE`

). Here `BATCH_SIZE`

is `64`

, and since we are using 50-dimensional GloVe vectors, `EMBED_SIZE`

is 50. We shuffle the sentences at the beginning of each epoch and return batches of 64 sentences. Each sentence is represented as a vector of GloVe word vectors. If a word in the vocabulary does not have a corresponding GloVe embedding, it is represented by a zero vector. We construct two instances of the generator, one for training data and one for test data, consisting of 70% and 30% of the original dataset respectively:

```
BATCH_SIZE = 64
def sentence_generator(X, embeddings, batch_size):
while True:
# loop once per epoch
num_recs = X.shape[0]
indices = np.random.permutation(np.arange(num_recs))
num_batches = num_recs // batch_size
for bid in range(num_batches):
sids = indices[bid * batch_size : (bid + 1) * batch_size]
Xbatch = embeddings[X[sids, :]]
yield Xbatch, Xbatch
train_size = 0.7
Xtrain, Xtest = train_test_split(sent_wids, train_size=train_size)
train_gen = sentence_generator(Xtrain, embeddings, BATCH_SIZE)
test_gen = sentence_generator(Xtest, embeddings, BATCH_SIZE)
```

Now we are ready to define the autoencoder. As we have shown in the diagram, it is composed of an encoder LSTM and a decoder LSTM. The encoder LSTM reads a tensor of shape (`BATCH_SIZE`

, `SEQUENCE_LEN`

, `EMBED_SIZE`

) representing a batch of sentences. Each sentence is represented as a padded fixed-length sequence of words of size `SEQUENCE_LEN`

. Each word is represented as a 300-dimensional GloVe vector. The output dimension of the encoder LSTM is a hyperparameter `LATENT_SIZE`

, which is the size of the sentence vector that will come from the encoder part of the trained autoencoder later. The vector space of dimensionality `LATENT_SIZE`

represents the latent space that encodes the meaning of the sentence. The output of the LSTM is a vector of size (`LATENT_SIZE`

) for each sentence, so for the batch the shape of the output tensor is (`BATCH_SIZE`

, `LATENT_SIZE`

). This is now fed to a `RepeatVecto`

r layer, which replicates this across the entire sequence; that is, the output tensor from this layer has the shape (`BATCH_SIZE`

, `SEQUENCE_LEN`

, `LATENT_SIZE`

). This tensor is now fed into the decoder LSTM, whose output dimension is the `EMBED_SIZE`

, so the output tensor has shape (`BATCH_SIZE`

, `SEQUENCE_LEN`

, `EMBED_SIZE`

), that is, the same shape as the input tensor.

We compile this model with the SGD optimizer and the MSE loss function. The reason we use MSE is that we want to reconstruct a sentence that has a similar meaning, that is, something that is close to the original sentence in the embedded space of dimension `LATENT_SIZE`

:

```
inputs = Input(shape=(SEQUENCE_LEN, EMBED_SIZE), name="input")
encoded = Bidirectional(LSTM(LATENT_SIZE), merge_mode="sum", name="encoder_lstm")(inputs)
decoded = RepeatVector(SEQUENCE_LEN, name="repeater")(encoded)
decoded = Bidirectional(LSTM(EMBED_SIZE, return_sequences=True), merge_mode="sum", name="decoder_lstm")(decoded)
autoencoder = Model(inputs, decoded)
```

We define the loss function as mean squared error and choose the Adam optimizer:

```
autoencoder.compile(optimizer="sgd", loss="mse")
```

We train the autoencoder for 20 epochs using the following code. 20 epochs were chosen because the MSE loss converges within this time:

```
num_train_steps = len(Xtrain) // BATCH_SIZE
num_test_steps = len(Xtest) // BATCH_SIZE
steps_per_epoch=num_train_steps,
epochs=NUM_EPOCHS,
validation_data=test_gen,
validation_steps=num_test_steps,
history = autoencoder.fit_generator(train_gen,
steps_per_epoch=num_train_steps,
epochs=NUM_EPOCHS,
validation_data=test_gen,
validation_steps=num_test_steps)
```

The results of the training are shown as follows. As you can see, the training MSE reduces from 0.1161 to 0.0824 and the validation MSE reduces from 0.1097 to 0.0820:

Since we are feeding in a matrix of embeddings, the output will also be a matrix of word embeddings. Since the embedding space is continuous and our vocabulary is discrete, not every output embedding will correspond to a word. The best we can do is to find a word that is closest to the output embedding in order to reconstruct the original text. This is a bit cumbersome, so we will evaluate our autoencoder in a different way.

Since the objective of the autoencoder is to produce a good latent representation, we compare the latent vectors produced from the encoder using the original input versus the output of the autoencoder.

First, we extract the encoder component into its own network:

```
encoder = Model(autoencoder.input, autoencoder.get_layer("encoder_lstm").output)
```

Then we run the autoencoder on the test set to return the predicted embeddings. We then send both the input embedding and the predicted embedding through the encoder to produce sentence vectors from each and compare the two vectors using *cosine* similarity. Cosine similarities close to "one" indicate high similarity and those close to "zero" indicate low similarity. The following code runs against a random subset of 500 test sentences and produces some sample values of cosine similarities between the sentence vectors generated from the source embedding and the corresponding target embedding produced by the autoencoder:

```
def compute_cosine_similarity(x, y):
return np.dot(x, y) / (np.linalg.norm(x, 2) * np.linalg.norm(y, 2))
k = 500
cosims = np.zeros((k))
i= 0
for bid in range(num_test_steps):
xtest, ytest = test_gen.next()
ytest_ = autoencoder.predict(xtest)
Xvec = encoder.predict(xtest)
Yvec = encoder.predict(ytest_)
for rid in range(Xvec.shape[0]):
if i >= k:
break
cosims[i] = compute_cosine_similarity(Xvec[rid], Yvec[rid])
if i <= 10:
print(cosims[i])
i += 1
if i >= k:
break
```

The first 10 values of cosine similarities are shown as follows. As we can see, the vectors seem to be quite similar:

**0.984686553478241**
**0.9815746545791626**
**0.9793671369552612**
**0.9805112481117249**
**0.9630994200706482**
**0.9790557622909546**
**0.9893233180046082**
**0.9869443774223328**
**0.9665998220443726**
**0.9893233180046082**
**0.9829331040382385**

A histogram of the distribution of values of cosine similarities for the sentence vectors from the first 500 sentences in the test set are shown below. As previously, it confirms that the sentence vectors generated from the input and output of the autoencoder are very similar, showing that the resulting sentence vector is a good representation of the sentence:

# Summary

In this chapter we've had an extensive look at a new generation of deep learning models: autoencoders. We started with the Vanilla autoencoder, and then moved on to its variants: Sparse autoencoders, Denoising autoencoders, Stacked autoencoders, and Convolutional autoencoders. We used the autoencoders to reconstruct images, and we also demonstrated how they can be used to clean noise from an image. Finally, the chapter demonstrated how autoencoders can be used to generate sentence vectors. The autoencoders learned through unsupervised learning. In the next chapter we will delve deeper into some other unsupervised learning-based deep learning models.

# References

- Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams.
*Learning Internal Representations by Error Propagation*. No. ICS-8506. California Univ San Diego La Jolla Inst for Cognitive Science, 1985 (http://www.cs.toronto.edu/~fritz/absps/pdp8.pdf). - Hinton, Geoffrey E., and Ruslan R. Salakhutdinov.
*Reducing the dimensionality of data with neural networks*. science 313.5786 (2006): 504-507. (https://www.semanticscholar.org/paper/Reducing-the-dimensionality-of-data-with-neural-Hinton-Salakhutdinov/46eb79e5eec8a4e2b2f5652b66441e8a4c921c3e) - Masci, Jonathan, et al.
*Stacked convolutional auto-encoders for hierarchical feature extraction*. Artificial Neural Networks and Machine Learning–ICANN 2011 (2011): 52-59. (https://www.semanticscholar.org/paper/Reducing-the-dimensionality-of-data-with-neural-Hinton-Salakhutdinov/46eb79e5eec8a4e2b2f5652b66441e8a4c921c3e) - Japkowicz, Nathalie, Catherine Myers, and Mark Gluck.
*A novelty detection approach to classification*. IJCAI. Vol. 1. 1995. (https://www.ijcai.org/Proceedings/95-1/Papers/068.pdf) *AutoRec: Autoencoders Meet Collaborative Filtering*, by S. Sedhain, Proceedings of the 24th International Conference on World Wide Web, ACM, 2015.*Wide & Deep Learning for Recommender Systems*, by H. Cheng, Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, ACM, 2016.*Using Deep Learning to Remove Eyeglasses from Faces*, by M. Runfeldt.*Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records*, by R. Miotto, Scientific Reports 6, 2016.*Skip-Thought Vectors*, by R. Kiros, Advances in Neural Information Processing Systems, 2015.- http://web.engr.illinois.edu/~hanj/cs412/bk3/KL-divergence.pdf
- https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
- https://cs.stanford.edu/people/karpathy/convnetjs/demo/autoencoder.html
- http://blackecho.github.io/blog/machine-learning/2016/02/29/denoising-autoencoder-tensorflow.html