Simple MNIST Autoencoder in TensorFlow

Recently I’ve been playing around a bit with TensorFlow. Even though my past research hasn’t used a lot of deep learning, it’s a valuable tool to know how to use. To get to know the basics, I’m trying to implement a few simple models myself. The benefit of implementing it yourself is of course that it’s much easier to play with the code and extend it! In this post, I will present my TensorFlow implementation of Andrej Karpathy’s MNIST Autoencoder, originally written in ConvNetJS. You can find the code for this post on GitHub.

An autoencoder is a neural network that consists of two parts: an encoder and a decoder. The encoder network encodes the original data to a (typically) low-dimensional representation, whereas the decoder network converts this representation back to the original feature space. The idea is that the low-dimensional representation forms a non-linear dimensionality reduction of the original data. Here’s a graphical illustration of the network structure:

Illustration of the network structure of the autoencoder. In reality the size of the layers is larger than shown here: the blue layers have 50 nodes in the actual network and the input and output layers have 784 nodes. Illustration of the network structure of the autoencoder. In reality the size of the layers is larger than shown here: the blue layers have 50 nodes in the actual network and the input and output layers have 784 nodes.

Okay, let’s get to the code. As mentioned, I’ll be reconstructing Andrej Karpathy’s structure, which consists of 2 fully connected layers with 50 neurons, then a layer with just 2 neurons, and then again 2 layers with 50 neurons before a final read-out layer. Since we’ll use a lot of fully connected layers we’ll make some quick utility functions to make our life easier. First, for the weight matrix we use: The functions for the weights and biases are taken from the TensorFlow MNIST tutorial.

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

and for the biases:

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

For the fully connected layer, we’ll make use of the fact that the MNIST data is monochrome, so we don’t have to care about the color channels. This means that we can define our fully connected layers simply as follows:

def fc_layer(prev, input_size, output_size):
    W = weight_variable([input_size, output_size])
    b = bias_variable([output_size])
    return tf.matmul(prev, W) + b

With these utilities in place, we’re ready to build our model. Recall that the MNIST images are 28 x 28 = 784 pixels. Here we consider the input to the model to be a single vector of length 784. The model can then be captured through the following function:

def autoencoder(x):
    l1 = tf.nn.tanh(fc_layer(x, 28*28, 50))
    l2 = tf.nn.tanh(fc_layer(l1, 50, 50))
    l3 = fc_layer(l2, 50, 2)
    l4 = tf.nn.tanh(fc_layer(l3, 2, 50))
    l5 = tf.nn.tanh(fc_layer(l4, 50, 50))
    out = fc_layer(l5, 50, 28*28)
    loss = tf.reduce_mean(tf.squared_difference(x, out))
    return loss, out, l3

This function clearly shows the symmetry in the model through the parameters to the fc_layer function. In the end, we compute the mean squared error between the input and the output image and use that as the loss function.

Before turning to the code for training the model, I’ll present some code for using a TensorBoard with this model. I wanted to be able to visualize the input, output, and latent layers in the model for a batch of input images. To do this nicely, I used the form_image_grid function from the Magenta package, with the following helper function:

def layer_grid_summary(name, var, image_dims):
    prod = np.prod(image_dims)
    grid = form_image_grid(tf.reshape(var, [BATCH_SIZE, prod], [GRID_ROWS,
        GRID_COLS], image_dims, 1))
    return tf.summary.image(name, grid)

This creates a nice tiled image of GRID_ROWS x GRID_COLS as a single tensor, which we can use in a Summary. For the sake of presentation, we can wrap the TensorBoard related stuff in a function:

def create_summaries(loss, x, latent, output):
    writer = tf.summary.FileWriter("./logs")
    tf.summary.scalar("Loss", loss)
    layer_grid_summary("Input", x, [28, 28])
    layer_grid_summary("Encoder", latent, [2, 1])
    layer_grid_summary("Output", output, [28, 28])
    return writer, tf.summary.merge_all()

By combining all summary statements in a single operation using tf.summary.merge_all(), we can use some simple code to create all summaries simultaneously. Finally, we have the following function to pull everything together and run the training:

def main():
    # load the data
    mnist = input_data.read_data_sets("./MNIST_data")
    # initialize inputs
    x = tf.placeholder(tf.float32, shape=[None, 28*28])
    # build the model
    loss, output, latent = autoencoder(x)
    # initialize optimizer
    train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
    # initialize summaries
    writer, summary_op = create_summaries(loss, x, latent, output)

    # run the training loop
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        for i in range(10000):
            batch = mnist.train.next_batch(BATCH_SIZE)
            feed = {x : batch[0]}
            if i % 500 == 0:
                summary, train_loss = sess.run([summary_op, loss],
                    feed_dict=feed)
                print("Step: %d. Loss: %g" % (i, train_loss))

                writer.add_summary(summary, i)
                writer.flush()

            train_step.run(feed_dict=feed)

That’s the full code for the MNIST autoencoder. There’s plenty of things to play with here, such as the network architecture, activation functions, the minimizer, training steps, etc. Below I’ll take a brief look at some of the results.

We can compare the input images to the autoencoder with the output images to see how accurate the encoding/decoding becomes during training. Remember, in the architecture above we only have 2 latent neurons, so in a way we’re trying to encode the images with 28 x 28 = 784 bytes of information down to 2 bytes of information. This is not entirely true because the latent neurons are not clipped to the range [0, 255]. It would be more accurate to say that the autoencoder is a nonlinear feature transformation that maps a 784 dimensional space down to a 2 dimensional space.

Here is an animation that shows the evolution over time of some input images and the corresponding output images of the network.

Animation of the input and output layer of the network over time.

As you can see, there’s a bit of a grey offset in the images. This is because the last layer of the network doesn’t clip the pixel values to the range [0, 255]. I found that adding a ReLU over the read-out layer can help remove this offset: I’m sure there are other ways to get rid of this offset too. For instance, you could apply a softmax over the last layer and multiply with 255.

out = tf.nn.relu(fc_layer(l5, 50, 28*28))

With this modification, the animation looks as follows:

Animation of the input and output layer of the network with a ReLU over the read-out layer.

From this figure it is evident that the network actually converges pretty quickly and then spends quite some time tweaking the few incorrect outcomes. Notice that the encoding/decoding of the images removes a lot of nuance from the images: for instance the 2 1 pair on the second row has a lot of detail on the left that does not appear on the right.

To look at how the latent space maps images from different digits, we can feed some test set images through the network and record the values of the latent neurons. Here is a scatter plot of this latent space for the first 1000 images from the test set:

Plot of the latent space for the first 1000 digits of the test dataset. Plot of the latent space for the first 1000 digits of the test dataset.

It can be seen that the latent space for the digit 1 is quite well defined, as well as that for 0 and 6. However, the mappings strongly overlap for the digits 4 and 9 as well as for 3 and 8. Note that this mapping is not necessarily a clustering of the data points: there is nothing in the network structure that requires the images of the same digit to lie in the same area of the latent space. For an exploration of clustering the MNIST dataset, see Chris Olah’s blog. The only requirement is that the encoding is such that the decoder network can reconstruct the input image accurately.

Finally, we can take a point in the latent space and see the image that the decoder network constructs from it. Below is an animation where we walk around a circle in the latent space and show the corresponding output image. This animation shows how digits are transformed into their neighboring digits in the latent space.

Animation of a path through the latent space and the corresponding output images. Note that the mapping here is slightly different than in the previous figure, because this is generated from a different run.

That’s it! The full code can be found on GitHub. I hope you learned something from this post, I know I did!