Convolutions

Neural Networks

A neural network is a network of transformations that input data goes through to yield a desired output. Let's take the example of a simple image comprising 1 pixel, 1 channel, and a simple neural network that has a single neuron, and a desired output, which is another image comprising a single pixel, a single channel:

Figure 1

A pixel of value 32 must me multiplied by some value w, called the weight, to achieve the target value of 82. This value, for the concerned image (i.e., the image that is a simple pixel with a value 32), is simply 2.5625. But the goal is to be able to get multiple pixels to multiple values which is why multiple weights and neurons are required. The new neural network looks something like this:

Figure 2

We now have to solve the values of 9 weights so that whatever pixels are passed in, the weights can produce the target upscaled image.

In the case of upscaling an input image of 4 by 4 pixels must produce an output image of 8 by 8 or 16 by 16 pixels, hence requiring a lot more weights to be learned. The ESRGAN that is used as the model for the CG Texture Upscaler contains upwards of 16 million such weights. The network part of neural networks becomes more manifest as the task increases in complexity.

Figure 3

Convolutions

We've seen how simple input data in the form of pixel values are multiplied by a weight that then is outputted as the new pixel value. The CG Texture Upscaler also relies heavily on convolutional layers. The initial RGB channels are transformed where the most important features of the image are extracted. These features are then sent through the neural network as depicted in the diagrams above.

Briefly convolution requires an image channel to be passed through a filter. This filter is itself a collection of weights as those we've seen above. Such weights are what help extract the important features. In the case of upscaling, the features concern the ability to expand a single pixel into 4 pixels while maintaining the same look of the image. To clarify, the orange matrix is a filter of dimensions 3 by 3 which represent weights that are multiplied by the original image (green) in a sliding window fashion. The output image (pink) is the new feature map.

Figure 4

Feature Map Dimensions

The get the dimensions of the output feature maps, we need to know

The dimensions of the input feature map (n_in, green in Figure 4), including width, height and the number of feature maps (n_out, there's only 1 in Figure 4)
The number of filters (a.ka. kernels, k) that are intended for use (only 1 in Figure 4, i.e., the sliding orange box.
The padding (p) and stride (s) to use. The padding adds blank pixel around the image so that the sliding box (the filter - orange in Figure 4) is able to move based on the stride picking up the exact amount of pixels and not overshooting.

Figure 5

The output feature map dimensions can be calculated using the following:

The padding and stride sizes are generally fixed according to a default rule. The kernels size k is also something we define. The example in Figure 5,

n_out = ((3 + 2 x 1 - 2) ÷ 1) + 1 = 4

The out feature map is then 4 by 4 pixels. The output feature map is larger than the input feature map due to the padding that was added. We can output multiple feature maps by increasing the number of filters (not k, k is the dimensions of each filter). Having 3 filters results in 3, 4 by 4, feature maps.

Multiple Filters

This process is done with other filters that we create. The training process determines the values in the filters. Each filter learns to filter out certain features and generate a new feature map. Generally, each filter outputs 1 feature map. 64 filters should output 64 feature maps. Below, 3 filters produce 3 feature maps, or 3 channels. These could be the RGB channels that we write as the new image, or could be the base set of channels to be processed through another set of filters. This entire process is called convolution and is done many times in large neural networks such as the ESRGAN used in the CG Texture Upscale.

Figure 6

Multiple Input Feature Maps

For multiple input feature maps (3 in the Figure 7) - not the same as Figure 6 that only has 1 feature map and 3 filters and 3 output feature maps - 1 filter per input feature map (3 in the figures below) but still only produce a single output feature map. If another output feature map is desired, another set of 3 filters is required. In general, for another output feature map, another set of filters matching the number of input feature maps is used. As with the single filter for a single feature map in Figure 4, the sum of products is the resulting new value of the pixel. For figure six with would be 9 values per input feature map, 9 value per filter, 3 filters resulting in the sum of 243 products (9 x 9 x 3).

Figure 7

Convolutions

Recent Posts

Comments