Recently I’ve been playing around a bit with TensorFlow. Even though my past research hasn’t used a lot of deep learning, it’s a valuable tool to know how to use. To get to know the basics, I’m trying to implement a few simple models myself. The benefit of implementing it yourself is of course that it’s much easier to play with the code and extend it! In this post, I will present my TensorFlow implementation of Andrej Karpathy’s MNIST Autoencoder, originally written in ConvNetJS. You can find the code for this post on GitHub.

An autoencoder is a neural network that consists of two parts: an encoder and a decoder. The encoder network encodes the original data to a (typically) low-dimensional representation, whereas the decoder network converts this representation back to the original feature space. The idea is that the low-dimensional representation forms a non-linear dimensionality reduction of the original data. Here’s a graphical illustration of the network structure:

Okay, let’s get to the code. As mentioned, I’ll be reconstructing Andrej Karpathy’s structure, which consists of 2 fully connected layers with 50 neurons, then a layer with just 2 neurons, and then again 2 layers with 50 neurons before a final read-out layer. Since we’ll use a lot of fully connected layers we’ll make some quick utility functions to make our life easier. First, for the weight matrix we use: The functions for the weights and biases are taken from the TensorFlow MNIST tutorial.

```
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
```

and for the biases:

```
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
```

For the fully connected layer, we’ll make use of the fact that the MNIST data is monochrome, so we don’t have to care about the color channels. This means that we can define our fully connected layers simply as follows:

```
def fc_layer(prev, input_size, output_size):
W = weight_variable([input_size, output_size])
b = bias_variable([output_size])
return tf.matmul(prev, W) + b
```

With these utilities in place, we’re ready to build our model. Recall that the
MNIST images are `28 x 28 = 784`

pixels. Here we consider the input to the
model to be a single vector of length 784. The model can then be captured
through the following function:

```
def autoencoder(x):
l1 = tf.nn.tanh(fc_layer(x, 28*28, 50))
l2 = tf.nn.tanh(fc_layer(l1, 50, 50))
l3 = fc_layer(l2, 50, 2)
l4 = tf.nn.tanh(fc_layer(l3, 2, 50))
l5 = tf.nn.tanh(fc_layer(l4, 50, 50))
out = fc_layer(l5, 50, 28*28)
loss = tf.reduce_mean(tf.squared_difference(x, out))
return loss, out, l3
```

This function clearly shows the symmetry in the model through the parameters
to the `fc_layer`

function. In the end, we compute the mean squared error
between the input and the output image and use that as the loss function.

Before turning to the code for training the model, I’ll present some code for
using a
TensorBoard
with this model. I wanted to be able to visualize the input, output, and
latent layers in the model for a batch of input images. To do this nicely, I
used the `form_image_grid`

function from the
Magenta package, with the following helper
function:

```
def layer_grid_summary(name, var, image_dims):
prod = np.prod(image_dims)
grid = form_image_grid(tf.reshape(var, [BATCH_SIZE, prod], [GRID_ROWS,
GRID_COLS], image_dims, 1))
return tf.summary.image(name, grid)
```

This creates a nice tiled image of `GRID_ROWS x GRID_COLS`

as a single
tensor, which we can use in a Summary. For the sake of presentation, we can
wrap the TensorBoard related stuff in a function:

```
def create_summaries(loss, x, latent, output):
writer = tf.summary.FileWriter("./logs")
tf.summary.scalar("Loss", loss)
layer_grid_summary("Input", x, [28, 28])
layer_grid_summary("Encoder", latent, [2, 1])
layer_grid_summary("Output", output, [28, 28])
return writer, tf.summary.merge_all()
```

By combining all summary statements in a single operation using
`tf.summary.merge_all()`

, we can use some simple code to create all
summaries simultaneously. Finally, we have the following function to pull
everything together and run the training:

```
def main():
# load the data
mnist = input_data.read_data_sets("./MNIST_data")
# initialize inputs
x = tf.placeholder(tf.float32, shape=[None, 28*28])
# build the model
loss, output, latent = autoencoder(x)
# initialize optimizer
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
# initialize summaries
writer, summary_op = create_summaries(loss, x, latent, output)
# run the training loop
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(10000):
batch = mnist.train.next_batch(BATCH_SIZE)
feed = {x : batch[0]}
if i % 500 == 0:
summary, train_loss = sess.run([summary_op, loss],
feed_dict=feed)
print("Step: %d. Loss: %g" % (i, train_loss))
writer.add_summary(summary, i)
writer.flush()
train_step.run(feed_dict=feed)
```

That’s the full code for the MNIST autoencoder. There’s plenty of things to play with here, such as the network architecture, activation functions, the minimizer, training steps, etc. Below I’ll take a brief look at some of the results.

We can compare the input images to the autoencoder with the output images to
see how accurate the encoding/decoding becomes during training. Remember, in
the architecture above we only have 2 latent neurons, so in a way we’re trying
to encode the images with `28 x 28 = 784`

bytes of information down to 2
bytes of information.
This is not entirely true because
the latent neurons are not clipped to the range `[0, 255]`

. It would be
more accurate to say that the autoencoder is a nonlinear feature
transformation that maps a 784 dimensional space down to a 2 dimensional
space.

Here is an animation that shows the evolution over time of some input images and the corresponding output images of the network.

As you can see, there’s a bit of a grey offset in the images. This is because
the last layer of the network doesn’t clip the pixel values to the range ```
[0,
255]
```

. I found that adding a
ReLU over the
read-out layer can help remove this offset:
I’m sure
there are other ways to get rid of this offset too. For instance, you could
apply a softmax over the last layer and multiply with 255.

```
out = tf.nn.relu(fc_layer(l5, 50, 28*28))
```

With this modification, the animation looks as follows:

From this figure it is evident that the network actually converges pretty
quickly and then spends quite some time tweaking the few incorrect outcomes.
Notice that the encoding/decoding of the images removes a lot of nuance from
the images: for instance the `2 1`

pair on the second row has a lot of
detail on the left that does not appear on the right.

To look at how the latent space maps images from different digits, we can feed some test set images through the network and record the values of the latent neurons. Here is a scatter plot of this latent space for the first 1000 images from the test set:

It can be seen that the latent space for the digit 1 is quite well defined, as
well as that for 0 and 6. However, the mappings strongly overlap for the
digits 4 and 9 as well as for 3 and 8. Note that this mapping is not
necessarily a clustering of the data points: there is nothing in the network
structure that *requires* the images of the same digit to lie in the same area
of the latent space.
For an exploration of
*clustering* the MNIST dataset, see Chris Olah’s
blog.
The only requirement is that the encoding is such that the decoder network
can reconstruct the input image accurately.

Finally, we can take a point in the latent space and see the image that the decoder network constructs from it. Below is an animation where we walk around a circle in the latent space and show the corresponding output image. This animation shows how digits are transformed into their neighboring digits in the latent space.

That’s it! The full code can be found on GitHub. I hope you learned something from this post, I know I did!