Handwritten Digit Recognition with Convolutional Neural Networks

Eric Kim
Aug 29, 2017
4 min read

This is a brief presentation of my analysis, click here for my full iPython notebook.

Introduction

There's nothing more frustrating that waiting in front of an ATM during your lunch break to deposit a check. Many bank patrons today use a feature on their phones to deposit checks which uses their phone cameras. But just how accurate is it to be able to trust? Sadly, we don't have that information available but I would consider anything under 1% error to be excellent. Luckily to us, we can make a neural network to mimic what these apps use. In this project, I will develop a deep learning model to achieve a near state-of-the-art performance on the MNIST handwritten dataset. I'm going to use Keras with TensorFlow.

Methods

To teach our machine how to use neural networks to make predictions, we are going to use deep learning from TensorFlow. Deep learning is a field of machine learning that uses algorithms inspired by how neurons function in the human brain. TensorFlow is a machine learning framework that Google created to design, build, and train deep learning models. The name, "TensorFlow", is derived from how neural networks perform on multidimensional data arrays or tensors. It's a flow or tensors, just like how the human brain has a flow of neurons!

We will also use Keras, which runs on top of TensorFlow. Keras was developed with a focus on enabling fast experimentation.

Data

We will be using the MNIST dataset. This dataset was constructed from a number of scanned document datasets available from the National Institute of Standards and Technology (NIST). These images were normalized in size and centered. Each image is in a 28x28 square (784 pixels). 60,000 images were used to train a model and 10,000 were used to test it. Excellent results achieve a prediction error of 1%. State-of-the-art results are approximately 0.2% which could be achieved with a large convolutional neural network.

Baseline Model with Multiplayer Perceptrons

We start with a baseline model so we can compare our convolutional neural network that we will use later. To do a multilayer perceptron model, we flatten our 28 by 28 pixel images into a single 784 length vector for each image. We then change the grayscale values from 0-255 to 0-1 to make things easier on our neural network. (Normalization) Finally, we change the categories 1-9 into a binary matrix. Our current neural network structure is as follows:

Visible Layer (784 Inputs) >> Hidden Layer (784 Neurons) >> Output Layer (10 Outputs)

Performance

With training on 60,000 samples and validation on 10000 samples, we get a baseline error of: 1.76%.

Simple Convolutional Neural Network

As expected, we achieved around 1-2% error which is great. However, we can do better. Here, we take advantage of Kera's capability of creating convolutional neural networks. We will use all aspects of a modern CNN implementation, including convolutional layers, pooling layers, and dropout layers.

Here are our changes for the baseline model:

We add a convolutional layer with 32 feature maps, with a size of 5 x 5. This is also our input layer which expects images to be added.
We then define a pool size of 2 x 2.
We randomly dropout 20% of our neurons to reduce the amount of overfitting.
We then flatten our data.
We add 128 neurons with a rectifer activation function like above.
Finally we use 10 neurons for the 10 prediction classes with a softmax activation function to output probability-like prediction for each class.

Our new neural network structure is as follows:

Visible Layer (1x28x28 Inputs) >> Convolutional Layer (32 maps, 5x5) >> Max Pooling Layer (2x2) >> Dropout Layer (20%) >> Flatten Layer >> Hidden Layer (128 Neurons) >> Output Layer (10 Outputs)

Performance

With training on 60,000 samples and validation on 10000 samples, we get a CNN error of: 1.07%.

Larger Convolutional Neural Network

Here we achieved around 1% error which is excellent. However, we can hit state-of-the-art results. To do this, we deepen then widen our neural network.

Our new neural network structure is as follows:

Visible Layer (1x28x28 Inputs) >> Convolutional Layer (30 maps, 5x5) >> Max Pooling Layer (2x2) >> Convolutional Layer (15 maps, 3x3) >> Max Pooling Layer (2x2) >> Dropout Layer (20%) >> Hidden Layer (128 Neurons) >> Hidden Layer (50 Neurons) >> Output Layer (10 Outputs)

Performance

With training on 60,000 samples and validation on 10000 samples, we get a CNN error of: 0.74%.

Conclusion

With our ability to take advantage of larger convolutional neural network with Keras, we were able to go from 1-2% prediction error to less than 1%, near-state-of-the-art results! However, even with this model there are still further improvements which we can do with image augmentation and a much more powerful GPU.

Possible Improvements

First, we start with the baseline image again, then we do image augmentation to the dataset.

Baseline

This is our original image.

Feature Standardization

Similar to different scalar values, we can standardize different images. The result that standardizing images brings is slightly darkening and lightening different images.

ZCA Whitening

Here, we reduce the redundancy of certain pixels in order to highlight certain features of images. Similar to principal component analysis, we use ZCA for images.