A Deep Dive into the ESRGAN (Part 2: the Generator)

The generator's architecture in an ESRGAN takes a low resolution image and outputs a higher resolution image. It does this through convolutions, and actvations where indicated. It involves a lot of concatenation and addition of feature maps in the architecture that exctracts meaningful feature map information as proven by trials (see the results from the CG Texture Upscaler that uses the ESRGAN as its AI model to upscale computer graphics textures).

ESRGAN Architecture

The ESRGAN is made of mainly 4 types of elements or smaller architectures, the convolutional layer, the basic block, the dense block and upsampling layers.

It uses the typical convolution 4 times.
It uses a basic block 23 times; each basic block uses 3 dense blocks.
Each dense block uses 5 convolutions
An upsampling block is once for the 2x upscale model and twice for the 4x upscale model

(For the time being, I won't go into where the activations are applied since it is not a fundamental part of the architecture; besides that, activations are simply applied for each feature map and do not alter the shape of layers and blocks as mentioned)

Figure 2

A Small Catch

There's a small catch in the ESRGAN when upscaling an image. The image that is intended for upscaling can be upsampled once to attain the 2x upscale, or twice to attain the 4x upscale.

The approach that I used and I've seen used is to always upsample twice. This would always result in an image that is 4x its original scale. To allow for a 2x upscale, the initial image is downscaled once. This is all done right at the beginning before the image goes into the network.

Figure 3

We can now move on to detail what goes on in the actual generator. For the purposes of the remainder of the tutorial, we'll assume we have an input image of dimensions 1024, 1024, 3 (height, width, channels) and that we're upscaling it to 4x its original resolution resulting in 4096, 4096, 3 (height, width, channels).

The Convolutional Layers

Figure 4

As mentioned previously, a convolutional layer will take the input image (with dimensions w, h and channels c) and transform it into a new image - a set of feature maps with a new set of dimensions and channel (feature map) number.

Conv 1 expects an image of dimensions w, h and 3 channels. The dimensions don't matter, they simply have to be a multiple of 2 (256 is a valid width, 255 is not) and greater than 0. Conv 1 outputs 64 feature maps. i.e., the initial 3 channels are converted into a total of 64 new feature maps. each feature map has a height and a width that can be calculated using this formula:

The padding size (p) and stride size (s) are not critical to understanding the ESRGAN architecture. They are generally fixed according to a default rule. The kernel however is familiar to us as it is the filters referred to earlier whose weights have to be learned (detailed here).

Hence, if the input dimensions of the image to the Conv 1 are 1024 by 1024, and Conv 1's k, p and s values are 3, 1 and 1, the resulting 64 feature map will have dimensions:

( ( 1024 + ( 2 x 1 ) - 3 ) / (1) ) + 1 = 1024

Figure 5

Every convolutional layer in the ESRGAN has k, p and s values equal to 3, 1 and 1.

Conv 2, 3 and 4 follow the same patter and differ in the resulting feature maps and their dimensions based on their inputs:

Conv 2: receives 64 feature maps from the final basic block, each feature map is dimensions 1024,1024,64. This will become clear as we detail teh dense and basic blocks.
Conv 3: receives 64 feature maps from the upsampled feature maps of Conv 2, each resulting feature map is dimensions 1024,1024,64.
Conv 4: receives 64 feature maps from Conv 3, but outputs only 3 feature maps (i.e., using only 3 filters) representing the RGB color maps of the final image, each channel with dimensions 1024,1024.

The Basic Blocks

Since each basic block is comprised of 3 dense blocks, the discussion of basic blocks is deferred to the end.

The Dense Blocks

Figure 6

Each dense block is comprised 5 total convolutional layers. Carrying on with our example of an image with dimensions 1024 by 1024,

The initial input to the dense block (Figure 6, A) is 1024,1024,64 from Conv 1 (from Figure 4).

Referencing Figure 6:

Each of the initial input is convolved 5 times (each convolution features k, p and s values equal to 3, 1 and 1).
Convolution 1 outputs feature maps with dimensions of 1024, 1024, 32. This output is passed on to Convolution 2, 3, 4 and 5 (Figure 6, B, C, D, E). At each point, the input to the follow convolution is concatenated from the output of the previous output. For example, At point B, the output from convolution layer 1 (1024,1024,32), is concatenated with the input to convolution layer 1 meaning the feature maps are stacked as in Figure 7 below. An activation function called ReLU (like the sigmoid discussed earlier) is applied to the output to each pixel of each feature map.
Convolution 2 (Figure 6, B) takes in feature maps with dimensions 1024, 1024, 96 (we're concatenating the input to Convolution 1 to the output of Convolution 1) and output feature maps with dimensions 1024, 1024, 32. A ReLU activation is then applied to output to each pixel of each feature map.
Convolution 3 (Figure 6, C) takes in feature maps with dimensions 1024, 1024, 128 (we're concatenating the input to Convolution 2 to the output of Convolution 1 and 2) and feature maps with dimensions 1024, 1024, 32. A ReLU activation is then applied to output to each pixel of each feature map.
Convolution 4 (Figure 6, D) takes in feature maps with dimensions 1024, 1024, 160 (we're concatenating the input to Convolution 3 to the output of Convolution 1, 2 and 3) and output feature maps with dimensions 1024, 1024, 32. A ReLU activation is then applied to output to each pixel of each feature map.
Convolution 5 (Figure 6, D) takes in feature maps with dimensions 1024, 1024, 192 (we're concatenating the input of Convolution 4 to the output of Convolution 1, 2, 3 and 4) and output feature maps with dimensions 1024, 1024, 64.

Figure 7

The output of the dense block is then the same shape of the input. There three dense blocks, each with the same input and output dimensions where the above-mentioned convolutions take place.

Back to the Basic Blocks

Figure 8

There are 23 basic blocks. In each basic block is a series of 3 dense blocks. As mentioned above, the input and output to/from each basic block is the same dimensions, 1024,1024,64. Each dense block output is added (not concatenated!) to the previous dense blocks output as demonstrated by the + symbols at points A, B and C. This pointwise addition is a pixel-by-pixel addition. This is not problematic as the dimensions of the inputs and outputs of each dense block are the same.

The output of dense block 1 (A) is multiplied by a factor (between 0 and 1) and is then added back to it's input resulting in output dimensions 1024,1024,64, but with each pixel having new values.
The output of dense block 2 (B) is multiplied by a factor (between 0 and 1) and is then added back to it's input resulting in output dimensions 1024,1024,64, but with each pixel having new values.
The output of dense block 3 (C) is multiplied by a factor (between 0 and 1) and is then added back to it's input resulting in output dimensions 1024,1024,64, but with each pixel having new values.

Figure 9

Since the input and out dimensions of each convolutional block is equal, then so to are the input and output dimensions of each basic block, i.e., 1024, 1024, 64 for out example image. We're now ready to pass the outputs of the final basic block into Conv 2 (Figure 4) which should yield output dimensions 1024,1024,64.

We're now ready to upsample.

Upsampling

Upsampling simply take an images and increases it's resolution, not affecting the number of feature maps. In the case of the ESRGAN, the 2x upsample occurs twice. The resulting feature maps after each upsample layer is as follows:

Upsample 1 takes 1024,1024, 64 input and outputs 2048, 2048,64
Upsample 2 takes 2048, 2048,64 input and outputs 4096,4096,64

Final result is that the feature maps are now dimensions 4096, 4096 and still number 64 in total.

We can now pass the feature maps into the final convolutional layers, 3 and 4.

Conv 3 (Figure 4) takes in feature maps with dimensions 4096,4096,64 and outputs feature maps with the same dimensions
Conv 4 (Figure 4) takes in feature maps with dimensions 4096,4096,64 and outputs the feature maps representing the final upscaled image, with dimensions 4096,4096,3

Conclusion

During training, the output image is passed to the discriminator to determine the quality of the upscale. During inference when the model is productionalized to actually used to create upscaled image, the out image is the actual final product of the upscaling.

A Deep Dive into the ESRGAN (Part 2: the Generator)

Recent Posts

Comments