Brief and easy to comprehend introduction to artificial neural networks. I describe biological inspiration, artificial neurons with their activation functions and multilayer perceptron architecture. Training methods and ways to prevent overfitting are mentioned.

## Biological neuron

In every human brain is almost 100 billion neurons—basic units of our nervous system. Everything what we do from solving mathematical equations or making sarcastic remarks to basic life functions like beating of heart is managed by our neural network.

However, the brain is not the most powerful computer in the world just because of this extreme number of neurons. Each of them connects to another 10 000 neurons using synapses. This creates one of the most fascinating and complex network in universe, where each neuron talks their neighbours.

Neurons communicate using electrical and chemical signals through synapses. When the signal exceeds threshold, neuron fires signal to other neurons. Otherwise it stays off.

During conscious or unconscious learning which is happening our whole life, the synapses are adjusted to our needs. If your braing triggers some specific synapses often, they get stronger and therefore more important. Otherwise they weaken.

## Artificial neuron

Artificial neuron imitates biological neuron in many ways. Artificial neuron (later just neuron) receives one or more inputs (similar to electrical signals in biological neuron) each with weight value, that you multiple input with. Weight imitates synapse strength and emphasizes which neurons are important and which are not.

Then will the neuron sum all the inputs multiplied by their corresponding weights (in other words dot product) and you get potential of neuron. Now you put the potential through activation function which choose output, that can be input of another neurons.

The most commonly used activation functions are sigmoid and ReLU.

### Sigmoid function

As much as I would like to exclude math from this post, I cannot when describing functions. Because well, they are mathematical functions… The sigmoid function is denoted as follows:

$$f(x) = \frac{1}{1+e^{-x}}$$

The $x$ here is the potential of neuron I described earlier and the $f$ denotes the function. The output can only be between 0 and 1. The graph looks as follows.

### ReLU function

Rectified Linear Unit, or just ReLU, is much simpler than it sounds. All it does is thresholding input at 0. So, if your potential ($x$) is above 0, it will just sent it forward. Otherwise it will be 0. We would denote this function as:

$$f(x) = max(0, x)$$

And the graph looks like line broken (rectified) at 0.

ReLU works in similar manner to biological neuron. It is either off ($x = 0$) or it fires signal. It is computationally faster than sigmoid, because you don’t have to do such complicated math operations.

But there is a problem with ReLUs: they can die. This means that large part of them in neural networks will have value of 0 and it will cause malfunction. You can prevent this by properly setting network parameters or using modification of ReLUs like Noisy ReLUs, Leaky ReLUs and ELUs which make sure the output is never 0, but slightly different.

## Architecture

Just like in our brain, you have to connect neurons together to form architecture. Unfortunately, we are not able to construct anything on par with brain. Yet.

However, we are able to create networks which can successfully solve specific task (like image detection or speech recognition). The most simple architecture is called multilayer perceptron and only consists of simple neurons I described earlier. They are put into layers and every neuron is connected with every other neurons in following layer (but not inside layers).

Input layer receives input from outside. For example it could be an image. In this case, for every pixel on the image, there would be one neuron in input layer. For MNIST, which is widely known dataset of handwritten digits, it would be 784 neurons, because the images are 28x28 pixels each.

Term hidden layers only means that it is neither input or output. The deeper they get the more abstractly they can work with data. Because we have relatively high computational capacity we can use more layers which will “work out” the output and we don’t have to pay close attention to the data like in other machine learning approaches.

The output layer can include one or more neurons, depending whether it is classification or regression problem. In classification, you will have as many neurons as you have possible classes (outputs). In this case, you will put softmax activation function on output layer. It will squash all inputs so they sum to 1. By doing this, you immediately get probability of prediction. For example, if the neuron which stands for class A has value 0.8 it means that the networks is right for this one on 80 %.

## Training the network

Right now if you put the image data through the network, it wouldn’t do anything useful. The goal is to properly adjust weights in the network, so it can predict correct output. The process of adjusting weights is called learning and its goal is to lower the prediction error.

The one you will probably use is supervised learning. In this case you have training dataset that contain input data (images of digits) and they are all labeled with correct output. We propagate input data through network and each time we set the weights so they can properly predict the input. That means if you put the same image twice it will be 100 % accurate. We will repeat this process for every image in training dataset, however we adjust weights by smaller batches (so called minibatches). Propagating input through network and backwards adjusting of weights is called backpropagation.

Initial weight are chosen randomly. We calculate the error by using loss function (e.g. MSE, cross-entropy). For adjusting the weight we need to use optimization functions like gradient descent which work closely with loss function. The goal is to find best possible value to decrease loss function.

In the end we test our network on test dataset. This dataset has to only contain new data, because like I mentioned earlier, the network would be 100 % accurate on data it was trained on. We call the error on new data generalization error and it is very important for evaluating our network. Network which has high accuracy on training data and bad accuracy on new data is useless.

### Overfitting

Overfitting is the omnipresent problem of all neural networks. It means that your network is overly specialized on training data so it fails on new data and is extremely fragile on input change.

To prevent this we create another dataset called validation dataset. During training of the network we will check that reducing error on train dataset will also reduce error on the validation dataset. If the error on validation dataset will not decrease for some time we will stop the learning. This is called early stopping.

Another method to prevent overfitting is dropout. It will turn of random part of neuron during training so that other neurons have to take their place and learn to properly predict the output. Individual neurons will not be able to overly specialize and overall network becomes more robust and resistant to small and unimportant changes in input (i.e. noise).