Forward Pass and Backpropagation in Deep Learning: DNNs and CNNs

Before delving into backpropagation, intuition is required first about the forward pass through a neural network.

The Forward Pass

The Forward Pass in Deep Neural Networks

A deep neural network consists of one or more nodes into which values are fed resulting in a single output value. In Figure 1 below, x1, x2 and x3 are all input values that are fed into a neuron Node.

Figure 1

The resulting output value is as follows:

As x1, x2 and x3 are fed through Node, each of them is multiplied by a respective weight. This weight can be any value. The goal of deep learning is to keep adjusting this weight as more value are passed through that node until the right weight values are attained.
The Xs multiplied by the Ws are then summed up as follows:

3. The Z, the sum of products, is then fed to a function f(z); this function is called an activation function. In its simplest sense, it is a non-linear transformation of Z. To demonstrate, consider the sigmoid function, when fed the value Z, another value results, which is a non-linear transformation. This allows the neural network to learn non-linear patterns in data, such as images and text:

4. The output of the sigmoid, a in Figure 1, is called the activation value; this value can be passed into numerous other neurons in the following layer. A neuron can also have multiple inputs and outputs. The activates are calculated in the same fashion in in Figure 2.

Figure 2

Forward pass in convolutional neural networks is depicted visually here.

The Loss

The goal of forward pass is to generate a prediction and compare it against the ground truth. The goal of machine learning and deep learning, after all, is to generate predictions that match real ground truth value. These predictions can be a binary prediction of the chance of it raining tomorrow or not (assigned value 1 and 0 accordingly), or an object on an image being a car or a pedestrian.

Figure 3

The difference between the neural network's prediction and the ground truth value, that we already have stored from data we've collected, is called the loss (or sometimes called the error, intuitively enough). If our predictions are far off from the ground truth value, the loss is large. The closer the closer the prediction is to the ground truth value, the lower the loss. The goal of machine learning and deep learning is to minimize this loss. Alternatively, the goal is to maximize the negative value of the loss since minimizing the value of the loss is the same as maximizing the negative value of the loss.

max(-loss) == min(loss)

This loss is then use to adjust the weights (Ws in the figures above) in a process called backpropagation which we'll discuss next. Once those are adjusted, we're ready to pass in a new set of data (Xs in the figures above) in a new forward pass. The resulting activations and ultimately output/prediction value will differ due to the adjusted weights, as well as the different inputs values. However, it is expected the loss resulting from the second forward pass will be less than that resulting from the first forward pass meaning the neural network is making predictions that are closer to the ground truth values.

Forward Pass in Convolutional Neural Networks

The above approach to processing input values using deep learning does not work well for image data. Image data takes the form of a matrix of values that represent the pixels along the width and height of an image. There also multiple matrices stacked on top of each other each matrix representing one of 3 colors, red, green and blue. This artcile details digital image channel information further.

Figure 4

Convolution requires an image channel to be passed through a filter. This filter is itself a collection of weights as those we've seen above leading up to each neuron node.

Figure 5

Such weights are what help extract the important features. In the case of upscaling, the features concern the ability to expand a single pixel into 4 pixels while maintaining the same look of the image. To clarify, the orange matrix is a filter of dimensions 3 by 3 which represent weights that are multiplied by the original image (green) in a sliding window fashion in an element-wise fashion. Each element in the weights filters is multiplied by the same element in the sliding window of the original image. In the top left cell of the pink matrix (output image) in Figure 5, the value 4 is calculated as follows:

The sum of the products is the value of the new pixel in the output image. We can also pass this value through an activation funciton like the sigmoid activation function discussed above resulting in an activation value of 0.98:

This is repeated by sliding the white window 1 stride over on the original image (green), which, when multiplied by the weights, results in the value 3 on output image (pink), and so forth until the entire output image is completely filled. To determine the size of the output image and how convolution works when the input image has multiple channels, see this article.

The output image is then compared to the ground truth image, and the pixel-by-pixel difference is the loss value, as in the case of deep neural networks above. The loss function could be something other than the simple difference. In Figure 6 below, the loss function is the mean squared error. Neural networks could also use non-linear activation functions as in the case of the deep neural networks. In such a case, the activation is applied to each pixel individually of the output image (pink). Finally, the output image doesn't have to be the final image. Instead, we could pass the output image through a new filter (orange) to produce yet another output image. To see an example of this, the article series on the ESRGAN demonstrates how an input image is convolved many times to generate a higher resolution image.

Figure 6

Backpropagation

Backpropagation in a neural network seeks to adjust the weights in a neural network, be they in fully connected layers or convolutional layers, so, the the following prediction is closer to the ground truth value. Ultimately, the two variables affecting predictions from one test data point to another are:

the value of the data point, or
the weights in the neural network

Other factors also affect the prediction but are generally fixed, like the activation functions or the loss functions that do not change throughout training (other values could be tweaked, throughout training a neural network, but this tweaking is usually controlled and done according to the developer's judgement or through a schedule).

That said, the first point, the value of the input data point, be it financial or geographical data points, change and the only control the developer has over those is to collect more diverse or better-quality data points, or augment the input data using fixed algorithms such as image transformations, or scaled financial figures. The weights on the other hand are the subject of focus in training neural networks as their values are adjusted throughout training through the process of backpropagation. The weights in essence represent an estimate of the aggregate features learned during training of all possible input data. In other words, any input data point, transformed by the weights in a neural network should yield an accurate prediction.

Figure 7

Backpropagation is the process of adjusting neural network weights based on the loss. Given the loss value, how much is the loss value affected by neuron node 1? The weight at neuron node 1 is adjusted based on that value. The adjustment to the weights in a neural network is then based on the individual weight's contribution to the final loss value. Put the differently, the adjustment to a weight, i, the rate of change of the loss relative to the rate of change of a weight i. This is the very definition of a derivative. Hence the derivative of the final loss value relative to the derivative of weight i is the value by which the weight i needs to change so that weight i no long contributes to the loss (in realistic terms, there will always be a loss since we cannot reproduce real world data precisely, hence weight i's contribution to the final loss is to be minimized rather than eliminated completely). As a side note, in deep learning frameworks like Torch and Tensorflow, the derivative of the final loss with respect to each weight and its contribution to the increase or decrease in the final loss value is calculated during the forward pass above.

Backpropagation in Deep Neural Networks

For deep neural networks, values are accumulated at nodes the sum of multiples of weights and values (be they the initial inputs, x1 to xi, or the intermediate activation, as in Figure 3). A derivative (also called the gradient) of the output of each node (Z in Figure 8 below). This technically happened after Z passes through an activation (represented y the circle with the sloped line). Hence the derivative of the activation is pertinent. For now, using an actual activation function like the Sigmoid or ReLU activation functions will increase the complexity of this tutorial, so we'll use a very simple activation: f(x) = x; hence no transformation is done, and whatever value the activation receives, it spits out. It's still an activation so it serves out purposes for understanding backpropagation, but simplified the demonstration.

The loss value, l, is difference between the predicted output (y1 hat) and actual output (y1). the change in the value l, as stated earlier, is directly affected by the change in the intermediate outputs (a, a1, a2, a3 in Figure 3) which are affected by the change in the weights (w1,w2,w3) and the inputs (x1,x2,x3). The most pertinent value are the weights themselves since we do not control the value of inputs and intermediate ouputs (x1,..,xi; a1,...,ai respectively). Hence the change in intermediate outputs, generalized as Z with respect to the change in weights, generalized as w, and the change in loss, l, with respect to the change in intermediate outputs, Z, are multiplied to give as the derivative at weight i. This multiplication process is called the chain rule invented by German mathematician Gottfried W. Leibniz.

Furthermore, the change of loss is actually compared individual weights, i.e., the effect on the change in weight w1 on the loss, l. This is referred to as a partial derivative and is indicated by the ∂.

Figure 8

Applying this to a realistic example, say stock prices. If we wanted to predict the value of a stock price for a certain day, say day t, we'd need to use information from the preceding days (day t-1, t-2, t-3, etc.) to predict the price on day t.

Suppose we collect data on the stock price and the underlying company's market share on day t-1, and earnings for the preceding year (earnings at day t less the number days since the beginning of the year. i.e., earnings at the end of last year). Let those be our input values. We wish to predict the price for the stock price at day t. The gradient at weight 1 or weight 2 will equal:

Where:

Figure 9

Let's assign values to our inputs (note that we start out with w and z values being 0):

x1 (market share as of day t - 1 in $ millions) = 20
x2 (earnings as of the end of last year in $ millions) = 2
w1 = 0.2
w2 = 0.5
z (intermediate output) = 0.2 x 20 + 0.5 x 2 = 5
activation f(x) = x: f(5) = 5
y1 hat (predicted value of stock at time t) = $5
y1 (actual value of stock at time t) = $6
l (loss) = 6 - 5 = $1

Normally, when training a deep neural network, multiple observations (also called instance; in convolutional neural net works, the observation/instance is a single image, multiple observations imply multiple images passed into the neural network at once) are passed into the network at once. The above is a single observation, let's add another. The 2nd observation will be get the market share from time t - 2, and earnings as of the end of last year likewise (note that weights are as):

x1 (market share as of day t - 1 in $ millions) = 15
x2 (earnings as of the end of last year in $ millions) = 2
z (intermediate output) = 0.2 x 15 + 0.5 x 2 = 4
activation f(x) = x: f(4) = 4
y1 hat (predicted value of stock at time t) = $4
y1 (actual value of stock at time t) = $6
l (loss) = 6 - 4 = $2

The loss function is the difference between the predicted and actual value, but when multiple observations are used during training, we take the average loss.

average loss = (loss from observation 1 + loss from observation 2) ÷ 2

So, the total loss from the above observations is simple: (1 + 2) ÷ 2 = 1.5

We can now proceed to calculate the gradients (derivatives) as follows:

Observation (instance) 1 results in the following loss:

Observation (instance) 2 results in the following partial derivatives loss:

Note that in our examples, we're using a linear activation that takes input Z and spits out the same value Z (i.e., f(x) = x, f(Z) = Z) which is why we can substitute the actual weights for the partial derivatives of weight 1 and weight 2, namely ∂w1 and ∂w2. At last, we can adjust each weight 1 based on the two observations as follows:

Where,

α is the learning rate between values 0 and 1 (since we don't want to affect the weights too drastically)
∂l1 is the rate of change of loss from observation 1 (partial to w1)
∂l2 is the rate of change of loss from observation 2 (partial to w1)

Hence, the new weight 1 is the average of the rate of change (gradient) of the loss across our 2 observations with regard to weight 1, adjusted by an alpha factor, subtracted from the existing weight 1. The same logic is applied for the updated weight 2.

Backpropagation in Convolutional Neural Networks

The following features a number of diagramic and equation illustrations from Pavithra Solai popular Medium article on backpropagation: https://pavisj.medium.com/convolutions-and-backpropagations-46026a8f5d2c.

Backpropagation in convolutional neural networks is slightly more involved than that of deep neural networks, so you are highly advised to review the first part of the article discussing the forward pass and the article on convolutions.

Backpropagation in convolutional neural networks involves calculating the partial derivatives (gradients) of each pixel of the output feature map with regards to the weights (represented by the filter) - just as in deep neural networks. It then requires calculating the partial derivative of the loss (difference between the ground truth image and predicted image) with respect to the output feature map. The product of the two terms represents the influence of the weight i in the feature map on the loss.

Consider the following simple convolutional operation involving a single channel 3 by 3 input feature map, a 2 by 2 weight filter, and a 2 by 2 output feature map.

Figure 10

Let L represent the loss calculated from the forward pass, O be the output feature map, and F be the filter map. Further, let ∂L be the change in the loss due to the current forward pass, ∂F be change in the weights in the filter of concern, ∂O be the change in the output feature map from the current forward pass. The rate of change of the loss, L, relative to the rate of change of weights in filter F can be represented as:

This can be represented by indexes of gradients for each weight in filter F, where i is the i-th element in filter F (element F11, F12, F21, F22), and where k the elements of the output feature map (M elements in total, O11, O12, O21, O22):

Hence, change in loss value L (a single value) relative to the change in filter weight i, is the sum of products of (1) the change in loss w.r.t. the change in output element k, (2) the change in output element k w.r.t. the change in filter value i, Fi, of interest, for a total of M products, that are summed up (k=1,...,M); for a total of 4 filter weights.

What is evident is that the gradient of the loss w.r.t. each filter weight (F11, F12, F21, F22) is affected by the gradient of every output (O11, O12, O21, O22) w.r.t. the filter weight i (i.e., the ∂O/∂F terms) and the gradient of the loss, L, w.r.t every output element.

The ∂O/∂F term can be replaced by the input value X since x * ∂F = ∂O, i.e., the change in the weight multiplied by the input is indeed the in change the output, naturally, since nothing else is influencing the value of the output except the input value X and filter value F.

We can now adjust the filter weights using the ∂L/∂F terms above.

In this simple example, there is a single convolution. If there are multiple convolutional layers, we'll need to calculated ∂O/∂F term for the current convolutional layer so that the layer before it can use to calculate the ∂L/∂F terms for that layer. Put differently the currently inputs to the convolutional layer were the result of a previous convolution which had its own filter; we cannot calculate the updated filter weights for that layer unless we get the gradients for the inputs to the current convolutional layer. I.e., how do the inputs to the current convolutional layer affect outputs O11, O12, O21, O22.

But we're not done just yet. Recall the ∂O/∂F that is critical to calculating the ∂L/∂F terms. We'll need to generate that for the current inputs (x11, x12,...,X33) in order for the layer before it. Figure 11 demonstrates this point. Consider if, the input feature map of Xs was not the original input feature map but instead just an intermediate out feature map from a previous convolutions. The original input feature map is instead represented by pixel values U.

Figure 11

The question arises is, if there was more than one convolution, we've only handled backpropagation through the current convolution indicated by the black rectangle. Convolution layers before have filters that also need their weights adjusted. That required knowing ∂X/∂F-1 where the feature map of Xs is the new output feature map instead of O, and F-1 is the filter of concern instead of F. As is apparent, ∂O/∂F from the current convolution is also required for the previous convolution, namely, ∂L/∂X is required to calculate ∂L/∂F-1. We now have to solve for ∂O/∂X.

Figure 12

For every element of X (X11, X12,...,X33), the change in loss w.r.t the change in X can be represented as a sum of products of (1) the change in loss w.r.t change in output k (2) the change in output k w.r.t. the change in the element of X of concern, resulting in products (k=1,...,M).

Specifically, w.r.t elements of X (X11, X12,..., X33), only certain elements of the filter affect the change in loss w.r.t the change in the element of X for which we want the gradient. For example, X11 is an element of the feature map X that is only ever passed once through a filter and only interacts with filter element F11 as in Figure 13.

Figure 13

The following details the gradients for each feature map element of X:

A concise way to go about calculating the above values is to, first, calculate ∂L/∂O terms that we've done before, flip the filter F, and convolve ∂L/∂O using filter F. This should result in a 3 by 3 matrix of gradients ∂L/∂X:

Figure 14

We're now ready to calculate ∂L/∂F-1 and update the weights of filter F-1. It is now a matter of rinsing and repeating.

If you got through all that, consider yourself special: you now know details about the process of the forward pass and backpropagation that many experienced industry practioners do not understand quite as thoroughly. There are further elements in the forward pass and the backpropagation process that were not convered such as batch normalization. Such aspects are left out here to maintain focus on the most important elements of the backpropagation. Future blog posts will tackle other aspects of the forward pass and backpropagation.

Forward Pass and Backpropagation in Deep Learning: DNNs and CNNs

Recent Posts

Comentários