Introduction To Convolutional Neural Networks

In previous article we talked about simple neural networks like perceptron. This article follows with convolutional neural networks. We will introduce them and their biological counterpart and then describe their main parts like convolution and maxpooling.


Relative position of individual pixels on image is important. Single pixel is merely a color and independently mean nothing, but more of them together form patterns and whole images. That is because spatial relationship of pixels is very important, especially in case of pictures.

Lets say we want to detect human face from image. Simple neural network would assign every pixel to one neuron in input layer. But what does that mean? It mean that we do not keep spatial information of pixels. We split the image into individual neurons and then feed the network with them.

But in case of face recognition you have parts like eyes. Eyes are complex objects composed of several parts. You have pupil, iris, sclera and even eyelids. Every eye has them. Would you be able to detect the eye only by one pixel or only by one part? Probably not. Only the whole in specific order make sense.

If you train neural network on image of eye it will only works if the eye will be on the same exact position on the image every time. When you move, scale or rotate the eye, the network will inevitably fail to predict the correct output.

We need some way to look for specific patters instead of individual pixels. And that is what convolutional neural networks do.

Biological inspiration

Convolutional neural network are inspired by visual cortex: part of brain designed by evolution to process visual information we get from eyes. From 50s David Hubel and Torsten Wiesel experimented with so called receptive fields. They discovered that some neurons are only activated when we see specific shape (and its orientation), color or motion. In video below, they tried to show specific shapes to cats and monitored their neural activity (the popping sounds mean activated neurons).

Convolutional neural networks

Convolutional neural network will choose kernels (sometimes called filters) that detect specific features (in other words shapes). It could be simple features like edges or complex like eye. Just like neural network you can stack convolutional networks into layers, where filters in deeper layers are able to detect more complex features.

Convolutional layers

On figure above you see representation of layers and filters in network for detecting face. First layer can detect basic edges. Second layer will detect features from previous layers thus it is able to detect more complex shapes like eye, nose or mouth. The third and last layer can detect whole faces.

Convolutional neural network mainly consists of several parts: convolution, activation, maxpooling and fully-connected layer, which are described in separate sections below.


In case of image detection, the input is an image transformed into matrix of pixel values. If you have grayscale image (like MNIST) each pixels has only one value, whereas colored image has three values per pixel (RGB).

During learning, convolutional network will choose appropriate kernels and all you need to do is define their amount (more filters can detect more features, but increases required time for convolution) and size (usually 3x3 or 5x5).

Then you will take kernels one by one and slide them on input image by some strife value. On each stop of the kernel, you will compute matrix multiplication between the kernel and part of the input image it covers. The summed output is passed into new matrix called feature map.


On picture above, convolution is done with kernels of size 3x3 and strife of 1. This image has only one color channel, thus there is only one layer. In case of colored images, you would have three layers and you have to convolve on each of them, outputting 3 feature maps.

The original input image is painted yellow and the 0s in white are added artificially. This is called padding and if you would not use it, you would be able to convolve only 3 times (on example image) on image of size 5x5. This means that feature map would be only 3x3 pixels large and after every convolution it would get even smaller. Sometimes you want to keep its its original size so you can execute more operations.


You have to pass the output of convolution through some activation function. Widely used and proven activation is ReLU, which is described in previous post and reduces time required to train the network.


Another important part of convolutional neural network is maxpooling. All it does is reduce the image size. You set size of maxpooling usually to 2x2 and then slide it on image, this time you do not want it to overlap. For every stop, you will only keep pixel of highest value. On example image below you can see maxpooling in work, each stop is painted in different color.


As you can see, maxpooling will reduce the number of pixels by 75 % which significantly reduces training time further in network. By doing this you can prevent overfitting, because small details are usually not important for predicting correct output.

Fully-connected layer

The purpose of convolution is just to extract features from image (or other dataset) and they cannot predict output by their own. This problem is solved by adding one or two fully-connected layers (which uses perceptron) at the end of convolutional neural network. These layers are trained to take output from convolution and predict the correct output.

The output of convolution is 3-dimensional matrix, because you have multiple feature maps. Therefore you flatten them by connecting every value to one neuron in fully-connected layer.

Convolutional neural network

On image above you see representation of very simple convolutional neural network. You convolve on input image with 3 kernels, therefore you create 3 feature maps, that are maxpooled. At the end you flatten the output into fully connected layer which is trained to predict the output.

Real world applications

Convolutional neural networks are commonly used in real world situations. For example Facebook uses them to recognize faces on photos with 97.35 % accuracy, Google uses them, among others, to let you search your photos by their content or people in it and self-driving cars use them to visualize roads.

Another recent and interesting example is DeepMind's AlphaGo, program made of convolutional neural networks trained using reinforcement learning by playing firstly with human opponent and later by playing against itself. This approach proved as highly successful and the program was able to beat Lee Sedol, one of the best players of Go in the world in 4 out of 5 games last March. Once again, computer defeated human.

Lee Sedol after third lost game out of five.