In this blog, we will be talking about neural networks (NNs) or Artificial Neural Networks (ANNs). ANNs imitates the human brain’s behavior to solve complex data problems, just like how the neurons work in the human brain.
These technologies solve problems in Image Recognition (Computer Vision), Speech Recognition, Pattern recognition, and Natural Language Processing (NLP), to name a few.
In this article, you will learn the basics of ANNs, It will also give you an in-depth interpretation of how neural networks operate.
BUILDING BLOCKS: NEURONS
First, each input is multiplied by their respective weight variable:
Then, all the weighted inputs are summed together with a bias b:
Finally, the sum is passed through an activation function:
The sigmoid function has a range of (0,1), which means it can only output numbers between 0 and 1. You can think of it as compressing (−∞,+∞) to (0,1) – where big negative numbers become ~0, and big positive numbers become ~1.
The process of passing inputs forward to get an output is known as feedforward.
COMBINING NEURONS INTO A NEURAL NETWORK
This network has 2 inputs (x1 and x2) also called the input layer (layer being a column of neurons), a hidden layer with 2 neurons (h1 and h2), and an output layer with 1 neuron (o1). Notice that the inputs for o1 are the outputs from h1 and h2 – that’s what makes this a network. There can be multiple hidden layers, this neural network has only one, also the number of neurons in each layer can be different for example one hidden layer with 2 neurons and a second hidden layer with 100 neurons.
How you choose the number of neurons and layers is by test and trial, generally you with a number of layers and neurons and then change them to see if you are getting better results.
The basic idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at the end. For simplicity, we’ll keep using the network pictured above for the rest of this blog.
TRAINING A NEURAL NETWORK
We will introduce the training part using an example, Let’s train our network to predict someone’s gender given their weight and height.
For that, we will need a dataset:
We’ll represent Male with a 0 and Female with a 1 when predicting, as our neural network can only generate numbers.
You must be wondering what our neural network will look like in this example:
Loss
Before we train our network, we first need a way to quantify how “good” it’s doing so that it can try to do “better”. That’s what the loss is.
We’ll use the mean squared error (MSE) loss:
Let’s break this down:
- n is the number of samples, which is 4 (Alice, Bob, Charlie, Diana).
- y represents the variable being predicted, which is Gender.
- ytrue is the true value of the variable (the “correct answer”). For example, ytruefor Alice would be 1 (Female).
- ypred is the predicted value of the variable. It’s whatever our network outputs.
Our loss function is simply taking the average over all squared errors (hence the name mean squared error). The better our predictions are, the lower our loss will be.
Better predictions = Lower loss.
Training a network = trying to minimize its loss.
Let’s say our network always outputs 0 – in other words, it’s confident all humans are Male. What would our loss be?
We now have a clear goal: minimize the loss of the neural network. We know we can change the network’s weights and biases to influence its predictions, but how do we do so in a way that decreases loss?
Let’s label each weight and bias in our network:
Imagine we wanted to tweak w1. How would loss L change if we changed w1? That is a question math and vector calculus can answer. You can read about it on Wikipedia in detail, but this blog will just explain the concepts behind it.
Basically, we calculate the partial derivative . How do we calculate it?
We get this equation through the backpropagation and Chain Rule of partial derivatives. We want to change our weights and biases in response to the loss. But how much to change depends on the partial derivative. It tells us that how much impact does w1 has on the loss, if the partial derivative is a positive value it tells that the current value of w1 has a part in increased value of L and the higher the value of partial derivative, the more impact it has, so we would want to tweak our value of w1 i.e. to decrease it to lower the loss (L).
How do we do that? let’s say, we have our partial derivative and now we would want to update our weights such that it decreases the loss. This is where Gradient Descent comes in.
STOCHASTIC GRADIENT DESCENT
We’ll use an optimization algorithm called stochastic gradient descent (SGD) that tells us how to change our weights and biases to minimize loss. It’s basically just this update equation:
η is a constant called the learning rate that controls how fast we train. All we’re doing is subtracting the partial derivative with the current value of weight. If the partial derivative is positive, w1 will decrease, which makes L decrease, and vice versa.
If we do this for every weight and bias in the network, the loss will slowly decrease and our network will improve.
Our training process will look like this:
- Choose one sample from our dataset. This is what makes it stochastic gradient descent – we only operate on one sample at a time.
- Calculate all the partial derivatives of loss with respect to weights or biases.
- Use the update equation to update each weight and bias.
- Go back to step 1.
PREDICTIONS
Now, we have our neural network all trained i.e., the loss is as minimum as it could get, we are ready to get some gender predictions from our network.
For example, if we have the following people, with their respective weight and height, and we want to use our trained neural network to predict their gender.
First, we input 133 and 65 for Alice and get 0.8 as output. Now, we were expecting 0 for Males and 1 for Females, but our neural network gave us 0.8, that is because of the sigmoid function in the output neuron as mentioned in the neuron intro.
We can just round off our outputs like if its 0-0.5 we get 0 and if it is >0.5 we get 1. So, in this case we get a 1, which is Female. Looks like our model is doing great.
A little note about the neural network’s output and the confidence in the prediction, for example, in this case, if the network outputs a value ~1 it means it is very confident in its female prediction given those input values and as the output value tends to go away from 1 but it still >0.5 means it is less confident in its prediction of the input values belonging to a Female, and vice versa for the Male prediction.
Fawad Arshad
Senior Data Engineer