The Reproduction of “Automated Pavement Crack Segmentation Using U-Net-Based Convolutional Neural Network”

11 min readApr 16, 2021

Authors: Jeroen Dekker & Aakaash Radhoe

In this blog, we will reproduce the results of Tables 1 & 2 of the original paper. In this reproduction, we try to get a similar result as found in the paper. This is done from scratch since there was no code or data shared by the authors of the paper. So we had to search for the datasets (CFD and Crack500) online and found these at: https://drive.google.com/drive/folders/1y9SxmmFVh0xdQR-wdchUmnScuWMJ5_O-. The U-Net-based Convolutional Network was made by us, with the help of the PyTorch documentation. For the reproducibility project, we chose to work with PyTorch as well since this library already had documentation on the U-net and ResNet34. Our code is available at: https://github.com/aakaashradhoe/CS4240-DL

The goal of the paper

Source: Lau, S. L., Chong, E. K., Yang, X., & Wang, X. (2020). Automated pavement crack segmentation using u-net-based convolutional neural network. IEEE Access, 8, 114892–114899.

The goal of the paper is to detect cracks in images. This is done by using a U-Net-based convolutional network with a ResNet34 encoder. The Resnet34 is already pre-trained on the ImageNet dataset. That is the architecture of the network, for the loss function used in the paper is the Dice Loss function, the output of the network is passed first through a sigmoid to clip the output between 0 and 1 values. The output is then passed through the loss function, which is used to optimize the weights of the layers. The network is split into three-layer groups and they all have different learning rates assigned to them. The layer groups received learning rates of ratio: 1/9: 1/3: 1. Before inputting the data from the dataset the images are resized, the neural network is trained on images with progressively increasing sizes: 128 x 128, 256 x 256, and 320 x 320. This was done by resizing and cropping the images. Further in the data pre-processing step, the authors applied image augmentations randomly to each image. There are three kinds of augmentations performed, which are: random rotations between 0° and 360°, random flips in the horizontal and vertical axis, and random changes in lighting.

Original results of the paper

The result of the paper we want to reproduce is tables 1 & 2, which show the precision, recall, and F1 score of the test images. This is already a table where the authors compared their results with that of other papers, it can be concluded that their method was outperforming the others in almost all the scores. We will compare our results with this table to see how well our model performed on the datasets.

Source: Lau, S. L., Chong, E. K., Yang, X., & Wang, X. (2020). Automated pavement crack segmentation using u-net-based convolutional neural network. IEEE Access, 8, 114892–114899.

The Model

We initially started figuring out how to work with the Resnet code. We studied its structure to understand how the layers were built and which PyTorch modules were applied. This is important since the U-shaped resnet34 network requires skipping connections in the Resnet encoder in the same layer to be used in the decoder. Searching the internet yielded no solutions to enable these skip connections with Resnet code and we chose to first implement the entire decoder in the Resnet code to continue working on other important parts of the reproduction project. Later we learned of a more elegant solution but were unable to implement it due to time constraints.

We edited the code in two parts of the Resnet code, the code we edited: https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py. Before we did this we needed to add the SCSE function. We couldn’t find the SCSE module code mentioned in the source paper so we used the SCSE from pytorch-segmentation/README.md at master · nyoki-mtl/pytorch-segmentation · GitHub.

We edited code in two parts of the Resnet class in the Resnet code. We first defined the PyTorch modules for decoder layers in the __init__ function. Each module in each layer has its own initialized weights, so also the batch normalization layers and the SCSE layers. The added code in this part is illustrated below.

Several important things should be noted here. We initialized the transpose convolution weights with the kaiming normal distribution function. It was not indicated in the reproduction paper if they used kaming normal or kaiming uniform. Secondly, it was unclear if other layers would also need such initialization. We assumed since it was not mentioned that this was not necessary. Thirdly we removed the weight initialization for the average pool and linear layer as was mentioned in the reproduction paper. Lastly, we added three layer groups that are needed for the learning rate cycler used later in the training of the data.

Besides in the __init__ function we also edited the forward function in the Resnet class. Here we added the decoder and the skip connections and removed the average pool and linear layer. Below this forward is shown.

The arrows indicate the additional edits in the encoder to ensure we were able to apply the skip connections. These skip connections go through a 1x1 convolution and are subsequently channel-wise concatenated with the output of the previous layer in the decoder. It is important to note that the output of the final output of the decoder goes through a sigmoid function to get values between 0 and 1 which is to ensure the output represents the probability of the pixel being a crack.

Dataset, batching & transforming

The dataset used for the reproduction are the CFD and CRACK500 datasets, these contain images of cracks with the ground truth (segmented version) of them included, see image above. For the CFD this included 72 train images and 46 test images and for the CRACK500 dataset, it contained 3792 training images and 2248 test images. For the CRACK500 dataset, this number was not the same as mentioned in the paper, in reality, there were a total of 3792 train images, which included the ground truth (segmented version) of the training images, the same applies to the test set. So for the CRACK500 dataset, the authors of the paper only had 1896 train images and 1124 test images, which is probably a mistake in the paper. Where the CFD dataset contained full-sized 480x320 images, the CRACK500 dataset contained crops of the original images. There were a total of 1400 original images, from where the crops were extracted. Due to time constraints and encountered problems with limited ram we were unable to run the Crack500 dataset.

The images are loaded by a custom-written data loader for this reproduction project, named import_and_format_data in the code. After loading the dataset, the images had to be split into training and test sets; this is done randomly in the custom alt_Datasorter function. The function alt_datasorter creates for the train and test set. The reproduction paper didn’t mention any batch sizes so we decided to implement batches of five.

During training just before the images in the batch are used as input for the forward, the data is processed with the same steps as described in the paper. Transformations are applied to artificially create a bigger dataset by randomly applying rotations, flips and changes in hue and contrast. We created a custom Transform function that handles the following preprocessing steps during each training iteration: rotate the image and its mask randomly in a range from 0 to 360 degrees, flips the image horizontal randomly, flips the images vertical randomly and, only for the image, adjusts the contrast and image balance by 0.05 randomly. After these steps the images are also resized during each training iteration to 128x128, 256x256 and 320x320, depending on the epoch. Till epoch 15 it keeps the image size at 128x128, then it changes the resize to 256x256 till epoch 30, and after it resized to 320x320 till epoch 45. Here we assume they changed the size of the images at the same instances, at ⅓ and ⅔ of all epochs, as mentioned later in the paper when they trained the network for 90 epochs to investigate the effect of the resizing.

Training the model

With the data loaded and pre-processed, we can start training our model. This is done in the last block of code in the notebook. In the block above we initialized the hyperparameters and values, we used in our custom scheduler for the learning rate used in our training loop. In the training loop block, we initialized the optimizer, which is AdamW. The authors of the paper used the default hyperparameters for this optimizer. For the custom scheduler of the learning rate (LR) we first set a base LR, which we then follow until the paper increases until we reach 40% of the epoch, from there we let the LR decrease linearly to zero, while in the paper they stated that the LR should approach near zero at the last epoch, using a cyclic scheduler, which would converge to the base LR. So, we made our own scheduler. Although they are very clear on this part, they did not mention which base learning rate they used to achieve the results we wanted to reproduce. Also, the layers were divided into three-layer groups, where the first layer group was frozen for the first 15 epochs. In the training loop, we used the dice loss function to compare the predicted output with the mask. After training, we set the PyTorch in eval mode and test the model.

Results

In this part, we will show our results from the model we trained and compare it with the original results from the paper. To calculate how good the model performs we used the measures used as in the paper. We wrote functions to calculate the precision, recall and the F1-score. It was stated in the paper that a True Positive (TP) was any pixel within a distance of 2 of a crack pixel in the mask. Which distance they used was not defined. The paper they referenced was also not very clear on how they did this. It was very likely they meant Hamming distance. But since this was unclear we wrote two implementations: one diamond shape, using the Hamming distance, and one when the diagonal distance is also distance 1, so a square. The unclarity of this made it difficult to reproduce the results. We chose to implement both and vary the learning rates to get more insight on which they could have used. The use of hamming distance seemed most obvious so we ran this more times.

Discussion

When we compare our results with the results from Table 1 of the paper we can conclude that the reproduction for the CFD dataset was unsuccessful. We can see that our model does not perform the same as theirs. When looking at our test scores it shows that the precision of our model is very close to that of the paper, but that the recall is way off (even compared with the other papers), which influences the F1-score. A too low recall means that too many pixels were identified as non-crack when they were actually a crack. This seems like underfitting, but when we compared the predicted image with the mask this didn’t seem to really show. Changing the base learning rate didn’t significantly increase this recall either. Something you would expect to see when you are overfitting (which sometimes happens with a high learning rate). This low recall is highly unlikely to be only due to the random data sorting for the train and test set or the random transformations on the data.

Three factors seem to be the most likely reason for this failure to reproduce. The first and most likely reason is that something went wrong with our implementation. Possibly something in how we evaluate the accuracy, how we deal with borders, how we transform or resize and crop the images or just something wrong in the forward.

Secondly, uncertainties arising from unclarities in the method in the paper and the leaving out of the used learning rate for the results produced in the paper. The authors of the paper were unclear on how they implemented the accuracy function, the exact meaning of some transformation like image balance, when they resized the data (if they even did that for the produced results in table 1), and which kind of weight initialization they used. The leaving out of the base learning rate for the produced results is especially peculiar.

Lastly and most unlikely are inappropriate/wrongly/incompletely installed packages or bugs in PyTorch. On one of the used devices, the code would not run because when we froze the first layer-group before training python would return an error implying that a local variable could not be found. We were unable to find the root cause of this problem since it didn’t happen on the other device. Inexperience in computer sciences in combination with limited time rendered us unable to appropriately address this problem.

Conclusion

Reproduction of this paper was difficult. The appearing very clear and concise paper was too concise. Very important aspects were only shortly mentioned which mostly left us on our own to figure things out. All in all, we did manage to create a pavement crack segmentation algorithm but were unable to reproduce the recall of the paper to be reproduced. Despite this, reproducing this paper really gave us a lot of room to explore the functionalities and the limitations of PyTorch and a lot more insight into the actual application of deep learning algorithms. For example the hours of looking through the section on optimizers to only find out that we had to write the learning rate schedular ourselves gave us a lot of time to learn about optimizers and learning rate manipulation schemes. It also confronted us with the fact that on the surface there is far more to a deep learning paper than it seems. Not simply building a network. But more importantly, it showed us what is important to include if you want to make a paper reproducible. By experiencing the difficulties of reproducing a paper we can learn to be more concise and complete in our own work so that people in the future may be able to reproduce our own deep learning algorithm.