A Deep Dive into the ESRGAN (Part 1: the Discriminator)

Background

The ESRGAN is a neural network that uses convolutions to learn to create higher resolution images of a lower resolution image. It does this in two distinct architectures.

When learning (training), the ESRGAN uses generator architecture to produce a higher resolution image. The generator's image is then passed to the discriminator architecture. The discriminator has to determine whether the image received is a valid upscale of the original image or not. If the discriminator deems the image fit, the generator has then learned to produce good upscaled images. If the discriminator determines that the image is a poor upscale of the original lower resolution image, the generator adjusts itself by adjusting the values of the convolutional kernels (discussed here) to produce better upscales.

Figure 1

The discriminator is quite intuitive. Improtant to understanding the architecture of the discriminator is understanding how convolutions work which is discussed succinctly here under the section on convolutions. What's important to know is that when an image is convolved, the original dimensions and the number of channels resulting from the convolutions may or may not change depending on the user's choice. The channels resulting from the convolution are more appropriately referred to as feature maps. In the case of the following example in Figure 2, the convolution resulted in 6 feature maps, each with width and height dimensions of 512 by 512.

Figure 2

Convolutions

Max Pooling

Another important concept is the idea of pooling. It simply refers to downscaling an image (i.e., feature map) using a certain appraoch. An example is max pooling where the maximum of 2 by 2 pixels is taken taken and set as the new pixel of the new post-pooling feature map as demonstrated in Figure 2. The resulting number of feature maps is the same since pooling is applied to each map separately in this case but it could technically be applied to multiple feature maps where the pixel on the new feature map is created by taking a 2 by 2 patch from 3 featur emaps for example, i.e. 2 by 2 by 3 (or 12) total pixels are pooled into a single pixel having the maximum value out of all of them.

Figure 3

This can be expressed formulaecally as follows:

Where

x, y represent the current pixel coordinates on the output map (pixel 0,0 as a value of 8)
h(l) represents the value of the current pixel on layer l (the current max pooling layer whose value we're trying to determine, named Output in Figure 3)
i, j represent the range of horizontal and vertical slider (in Figure 3, i = j = 0,..,1 inclusive)
h(l-1) refers to the values of previous layer from which we're pooling (named Input in Figure 3)

Hence, the pixel at the current max pooling layer (l) at pixel 0,0 (wich contains value 8) is the max of the previous layer's (l-1) pixels (0 to 1), (0 to 1) (which contain values 7, 3, 8, 7).

An alternative to max pooling could be average pooling where the values in the slider are averaged with equal weights into a single value.

The Discriminator

The discriminator is tasked with producing a prediction as to whether the image produced by the generator is good or not. In the case of the ESRGAN the discriminator is tasked with predicting the difference between the generator-upscaled image and the original higher resolution image from our training data. This difference is called a loss. the lesser this loss the better since it means the generator is upscaling images that look close to the authentic higher-resolution image. This is discussed further in Part 1 of this article series.

The Discriminator Architecture: the VGGNet

The VGGNet is a name for a variety of deep convolutional neural network architectures that share similar features. The CG Texture Upscaler uses an architecture similar to the VGG19. The VGG19 is a common architecture that is used as the basis for the ones that came after it.

Figure 4

The VGG19 uses convolutions to gradually shrink an images dimensions while increasing the number of feature maps. An image's features contained in its width and height are essentailly exctracted into a great number of feature maps. These feature maps become numberous and small in dimensions capturing very specific features of an image as deminstrated by the orange block in Figure 4 above, under section 7.

Image input (section 1): the VGG19 expects images of 224x224 or 256x256 with 3 channels.

The following is a list convolutions that such an image goes through, assuming the image's height and width are 224 by 224 (Figure 4, section 1). The convolutional inputs and outputs are better exaplained in this article under the section titled The Convolutional Layers.

The Convolutional Layers

Conv 1 (section 2): takes in feature maps with w,h and channel dimensions 224, 224, 3 (i.e. the original image) and outputs feature maps with dimensions 224, 224, 64.

The way we come about these dimensions for the output feature maps is as follows. The number of feature maps, 64, is simply something we define. The pixel dimensions (224x224) are based on the following formula:

where is 224, k = 3, p = 1, k = 1, resulting in ((224 + 2 x 1 - 3)/1 ) + 1 = 224.

Conv 2 (section 2): takes in feature maps with dimensions 224,224,64 and outputs features maps with dimensions 224,224,128
Max pooling 1 (section 3): takes in feature maps with dimensions 224,224,128 and outputs features maps with dimensions 112,112,128 (recall that it is simply downscaling each feature map, not affecting the number of feature maps).

Conv 3 (section 3): takes in feature maps with dimensions 112,112,128 and outputs features maps with dimensions 112,112,128
Conv 4 (section 3): takes in feature maps with dimensions 112,112,128 and outputs features maps with dimensions 112,112,256
Max pooling 2 (section 4): takes in feature maps with dimensions 112,112,256 and outputs features maps with dimensions 56,56,256
Conv 5 (section 4): takes in feature maps with dimensions 56,56,256 and outputs features maps with dimensions 56,56,256
Conv 6 (section 4): takes in feature maps with dimensions 56,56,256 and outputs features maps with dimensions 56,56,256
Conv 7 (section 4): takes in feature maps with dimensions 56,56,256 and outputs features maps with dimensions 56,56,256
Conv 8 (section 4): takes in feature maps with dimensions 56,56,256 and outputs features maps with dimensions 56,56,512
Max pooling 3 (section 5): takes in feature maps with dimensions 56,56,256 and outputs features maps with dimensions 28,28,512
Conv 9 (section 5): takes in feature maps with dimensions 28,28,512 and outputs features maps with dimensions 28,28,512.
Conv 11 (section 5): takes in feature maps with dimensions 28,28,512 and outputs features maps with dimensions 28,28,512.
Conv 12 (section 5): takes in feature maps with dimensions 28,28,512 and outputs features maps with dimensions 28,28,512.
Conv 13 (section 5): takes in feature maps with dimensions 28,28,512 and outputs features maps with dimensions 28,28,512.
Max pooling 4 (section 6): takes in feature maps with dimensions 28,28,512and outputs features maps with dimensions 14,14,512
Conv 14 (section 6): takes in feature maps with dimensions 14,14,512 and outputs features maps with dimensions 14,14,512.
Conv 15 (section 6): takes in feature maps with dimensions 14,14,512 and outputs features maps with dimensions 14,14,512.
Conv 16 (section 6): takes in feature maps with dimensions 14,14,512 and outputs features maps with dimensions 14,14,512.
Conv 17 (section 6): takes in feature maps with dimensions 14,14,512 and outputs features maps with dimensions 14,14,512.
Max pooling 5 (section 7): takes in feature maps with dimensions 14,14,512 and outputs features maps with dimensions 7,7,512

The Fully Connected Linear Layers

Following max pooling layer 5 in section 7 is a series of fully connected layers. This entails flattening the resulting feature maps from flattening feature maps with dimensions 7,7,512 into a single array resulting in an array of pixels totalling 7 x 7 x 512 = 25,088 elements.

The process of flattening the feature maps is simple, take the fist row of a feature map and align it against the second row on the horizontal plane, Continue doing this till all rows for a single long row. Then go onto the second feature map and add its rows to the existing long row as demonstrated in Figure 5 below.

Figure 5

The fully connected layer looks like this where every pixel of the 25088 pixel is connected to every one of 4096 output nodes as demonstrated in Figure 6 below.

Figure 6

The 25,088 pixesl are each multipled by a weight, w, the resulting is the sumple of all the products, i.e., (x_1 x w_1 + x_2 x w_2 + ... + x_i x w_i, where i is 25,088; this is just to get the single value of output node one, out of the remaining 4,095 output nodes.). The sum of products (a single value) is then passed through an activation (the appendix to this article discusses activations) as demonstrated in Figure 7 below.

Figure 7

Fully connected layer 1: The final feature maps resulting from max pooling layer 5 (Figure 4, secontion 7) is flattened and passed through a fully connected layer. The inputs equate to 25,088 elements and the outsputs 4,096 elements.
Fully connected layer 2: takes in 4,096 elements from Fully connected layer 1 and outputs 4,096 elements.
Fully connected layer 3: takes in 4,096 elements from Fully connected layer 2 and outputs 1,000 elements.
Fully connected layer 4: takes in 1,000 elements from Fully connected layer 3 and outputs 1,000 elements.

And just like that, we now have an output from the discriminator.

In the ESRGAN, during training, this entire process is repeated twice as mentioned:

Once for the original high resolution image that is part of our training data
Another for the high resolution image generated by the generator

The difference in outputs (being 1024 nodes) is compred and produces a loss. The lower this loss the better since a lower loss entails that the original high resolution image compared to the high resolution image generator by out generator are similar indicating high quality upscales. If this loss is large, the generator will continue to adjust its weights (values of the kernels) to produce higher quality upscaled. The discriminator will continue to adjust its weights as well (the weights in the convolutional kernels as well as the fully connected layers).

During deployment, the discriminator isn't used at all and only the generator is used to produce the high quality upscales since the generator has already been trained and will not adjust its weights further.

Appendix: Activations

Another important aspect of deep learning is the activation function. The activate function is a simple non-linear function that is applied to every pixel in the resulting feature maps. An example is the sigmoid activation that processes a pixel as follows:

f(x) is the resulting value of the pixel on the feature map. x is the input value of the pixel beforehand. If the value of the pixel is 0.05, applying the activation function to it transforms its value to become 0.95. Activations transform pixels in way that are not linear allowing the neural network to learn feature in a non-linear fashion. Having no activation function creates a neural network that can only learn feature about an image linearly leading to poor results. Note, activations do not change the shape of the feature maps at all but simply change of value of each pixel on each feature map.

A Deep Dive into the ESRGAN (Part 1: the Discriminator)

Recent Posts

Comments