A Machine Learning Model for Improving Building Detection in Informal Areas: A Case Study of Greater Cairo

Building detection in Ashwa’iyyat is a fundamental yet challenging problem, mainly because it requires the correct recovery of building footprints from images with high-object density and scene complexity. A classification model was proposed to integrate spectral, height and textural features. It was developed for the automatic detection of the rectangular, irregular structure and quite small size buildings or buildings which are close to each other but not adjoined. It is intended to improve the precision with which buildings are classified using scikit learn Python libraries and QGIS. WorldView-2 and Spot-5 imagery were combined using three image fusion techniques. The Grey-Level Co-occurrence Matrix was applied to determine which attributes are important in detecting and extracting buildings. The Normalized Digital Surface Model was also generated with 0.5-m resolution. The results demonstrated that when textural features of colour images were introduced as classifier input, the overall accuracy was improved in most cases. The results show that the proposed model was more accurate and efficient than the state-of-the-art methods and can be used effectively to extract the boundaries of small size buildings. The use of a classifier ensample is recommended for the extraction of buildings.


Introduction
Unsafe area and unplanned areas are an umbrella concept of informal settlement. Within the Egyptian context, they have been known as 'Ashwa'iyyat'. Unsafe areas are identified by the presence of a life-threatening threat, whereas unplanned areas are identified by noncompliance with planning and construction laws [1]. They are suffering from problems of narrow streets, air pollution etc.
As a result, the Egyptian government developed a national strategy for improving Ashwa'iyyat, followed by an action plan to establish new towns in the desert toward "Egypt Without Ashwa'iyyat" by 2030 to achieve Sustainable Development Goal 11; "Make cities inclusive, safe, resilient, and sustainable" [2]. Building boundaries are traditionally drawn using photogrammetric stereo plotters and hand digitization from digital photos in stereo vision. However, this is a time-consuming and exhausting operation that necessitates the use of trained personnel and expensive equipment. As a result, utilising automated ways to extract buildings has a lot of potential and value [3]. Many complex factors that influence buildings extraction from satellite images, such as scene complexity, building inconstancy, and image resolution, affect overall accuracy for building extraction and detection [4]. The main challenge of this study is to provide precise and high accurate framework for buildings detection that is essential for a variety of applications [5], including urban planning and management, infrastructure development, and so on.
The fundamental problem with building extraction methods is that the building class is confused with other object classes such as shadows, vegetation, and the ground. Other issues with misclassification include misclassifying a non-building as a building and mixing trees and shadows. These misclassification issues, which are caused by a single dataset and approach, have a negative impact on the accuracy of the classification process. As a result, various methods and strategies have been proposed to address the issues raised by the complexity of the classification process [6]. Another issue is that the building's roof could be made of several surface materials with varying reflecting qualities, which is a challenge.
As a result of these issues, the computer vision field has become increasingly complicated, resulting in errors and complexity in automatic building detection. Therefore, many algorithms and strategies have been developed to solve these problems [7].
In order to extract buildings, various approaches were proposed by many researchers which include image-based [4,[8][9][10][11], LiDAR-based [12,13] and data fusion-based methods [14,15]. Although these approaches have shown promise in real-world applications include highly complicated scenes, such as the presence of barriers posed by surrounding objects, such as trees. To overcome these challenges, we proposed a framework to integrate multi-source image fusion technology with digital surface model (DSM) using machine learning (ML) techniques for buildings detection.
Sohn and Dowman [16] suggested an algorithmic method for extracting buildings from IKONOS photos in densely populated areas. They used huge detached buildings in their investigation without performing any accuracy analysis or modelling of the structure features. Support vector machines (SVMs), ensemble classifiers (i.e. random forest (RF)), and deep learning algorithms are the most commonly used techniques in the remote sensing society [17]. Belgiu and Drǎguţ [18] employed the RF classifier to successfully map urban buildings. In addition, single-date MODIS data has been used to categorise urban impervious surfaces [19].
Several studies [20,21] have investigated the potential of RF classification to improve urban classification from LiDAR data. In case of input datasets that could be used to improve categorization, it is critical to involve only related datasets to reduce computational burden without immolating accuracy [22]. In this case, an RF classifier was used to assess the achievement of each data source to the results [22,23].
The SVMs algorithms are kernel-based learning algorithms used in various applications [17]. Sigmoid, polynomial, radial basis etc are kernel models used to build different SVMs [24]. The performance of SVMs is determined by the appropriate selection of a kernel function [25]. As a result, another goal of this research is to compare RF and SVM classification algorithms.
The process of combining images from multiple sources into a single imagery is referred to as multi-source image fusion technology, where the resulted fused image would be more beneficial than any of the input imageries, and it has major importance to the photogrammetry tasks in computer vision [26][27][28][29]. A detailed review can be found in [30].
Recently, few investigations have been done on the multi-source image fusion classification using machine learning. The authors of [18,31] developed a pixel-wise classification technique, called machine multi-level fusion network, and the authors of [19,32] developed fusion-based approach called 'SubFus' with capability to integrate remotely sensed data and ancillary dataset for land cover classification.
The current study developed a model to integrate multi-source image fusion technology with the normalized digital surface model (nDSM) and texture data using ML techniques for building detection and extraction. grey-level co-occurrence matrix (GLCM) was used to determine which attributes are important in detecting and extracting buildings. WorldView-2 and Spot-5 images were fused using three image fusion techniques: modified-IHS, wavelet-PCA, wavelet-IHS.
To classify the dataset into buildings and non-buildings, ML techniques such as SVMs and RF were used. To ensure the reliability of the results, dataset was divided into 70% for training and 30% for validation. The main objectives of this study are development of ML model for buildings detection and determine which attributes are important in detecting and extracting buildings. The performance of the RF is compared to that of SVMs. The results revealed that our approach outperforms other peer methods for building detection, indicating the effectiveness of our method.

Field of Study and Data Sources
The study area ( Fig. 1) is part of Greater Cairo. Greater Cairo is one of the biggest urban areas in Middle East and Africa. The study area is located between 29°54′50″N and 31°16′14″E. It is area is approximately 30 km 2 . The experimental tests were carried out in the residential urban blocks selected in Greater Cairo. The data includes representative scene of a mix of low-and high-story buildings with a long range of rooftop structures and small size buildings. The following data sources are used.
-Cloud free WorldView-2 panchromatic stereo imagery (Tab. 1). Dated 12 Jan 2018, with 0.5-m resolution. The overlap of images is 90%. Worldview stereo scenes are supplied with rational polynomial functions (RPC) sensor model, derived from orbit and attitude information.  -Thirty two DGPS ground control points and sixteen check points obtained from the field with 10-cm accuracy in X, Y, Z. The location of the points was selected using random stratified method to constitute various land cover classes of the area. Figure 2 illustrates the building detection framework and an illustration of the framework will be described in the following sections.

Image Orientation
The LPS environment [33] was used to produce Digital Surface Model (DSM) and image orientation of the stereo images. The process flow chart is illustrated in Figure 3.

Creating Block (.blk) Files
The process in LPS starts with making a block project file defining the geometric model as RPC model. The RPC file includes the third degree polynomial coefficients that relate an image coordinates to its corresponding ground coordinates. UTM projection and WGS 84 datum have been assigned as the horizontal and vertical coordinates inside the block project.

Setting up Internal and External Orientation
The interior and exterior orientations of stereo pairs are recovered by extracting information from RPC file. Interior orientation sets the internal geometry of a sensor at the image capturing time while exterior orientation set the position and angular orientation of the sensor that captured the image [34].

Adding GCP Points and Automated Tie Point Generation
Automatic GCP and tie points collection were used to ensure the relative orientation link the images and get a stable block. It matches a point in one image, with its corresponding in the other (stereomate) image using a math model (rational function model). During Image Matching in LPS, a link window occurs between the reference image and the neighbouring overlapping image. Because most buildings in very high-resolution images appear similar, a larger value of moving window should be specified for more robust matching results. Total 150 tie points were generated in the overlapping area whose ground coordinates are not known. Using classical point measurement tool four ground points are added to the images. AeroTriangulation errors were 0.15 pixels with GCPs, and 0.731 pixels without GCPs.

DSM Generation
The triangulation process was run after adding GCPs and tie points to check the accuracy for GCPs and tie points. Thirty independent check points were utilized for DSM evaluation. The results revealed the RMS of 0.1 m was achieved using RPC.

Image Ortho-rectification
The images were rectified using the DSMs with a pixel size 0.5 m and nearest neighbour resampling. The thirty DGPS check points were applied for orthoimage assessment resulted in RMS in X = 0.66, Y = 0.49, RMST = 0.82 using RPC only.

Rectification of Spot-5 Image
The Spot-5 image was rectified using 16 well-distributed GCPs obtained via DGPS. Images were projected to the UTM coordinate system using second order polynomials and the nearest neighbour algorithm as a resampling method during this process. The Spot-5 image was resampled to 0.5-m resolution to match the world view image. The precision was tested using 16 evenly distributed GCPs points. For the second order polynomial, the RMS error of check points was RMST = 0.321. Spot-5 was co-registered to the WorldView-2 image and resampled to a pixel size of 0.5 m using nearest neighbour.

Image Fusion
Three image fusion techniques (modified-IHS, wavelet-PCA, wavelet-IHS) have been applied on Spot-5 and WorldView-2 images. Color preservation and spatial improvement were statistically evaluated in all images.

Spectral Evaluation
In general, the quality of an image-fusion technique might be determined by comparing the fused image to a reference image. This comparison can be done both visually and statistically [35].
The quality evaluation parameters used for image fusion are: mean, standard deviation (Std), correlation coefficient and bias of mean [36][37][38].

Textural Feature Analysis Derived from Remote Sensing Data
Texture measures such as the co-occurrence matrix have shown their potential for enhancing classification performance in urban environments [29,39].
In our study, GLCM was used to derive textural features using a small window of 3 × 3 pixels. ENVI was used for Extracting GLCM. The computed texture features were: mean, variance, homogeneity, correlation, contrast, second moment, dissimilarity, entropy [40].

Classification Using Machine Learning Algorithms
Various classification strategies are used in building detection for image classification such as maximum likelihood, support vector machines etc. [41]. In the current research, classifications were done using machine learning techniques; RF and SVMs. It is concerned with a binary classification problem. It divides the image into two categories: buildings and non-buildings. The classes are saved as regions of interest (ROIs) in a shapefile. Each ROI has one target value (i.e., the class label) and several attributes (i.e. the features or observed variables). These techniques can train a classifier using ROIs, and then use the relationships discovered during the training process to classify the remaining pixels. The classification aims to produce a model that predicts the target values of the test data based solely on the test data attributes. All tests were conducted using the same training sets. The training data consists of 623 and 1,833 training pixels for buildings and non-buildings respectively. All pixels are distributed randomly. Figure 5 shows the distribution of pixels over the image. The classifier was trained with 30% of validation per class and 70% of training. These files were used as predictor variables in the classification model. Classifications using all these schemes were repeated once for RF and a second time with SVMs, separately and individually.

Classification Using Random Forest
RF is a model that uses a random grid search to fit a figure of decision tree classifiers on different sub-samples of the dataset and then averages them to enhance predictive accuracy and control over-fitting. Classifiers are integrated in the scikitlearn implementation by averaging their probabilistic prediction [42]. Cross-validation was used to select the parameters that best generalised the data; find the best figure of trees, and the best figure of maximum features. The most important parameters are n estimators and max. features, where n estimators is the number of trees, and the higher the number, the better.
When dividing a node, max features is the size of the random subsets of features. Either all input features or a random subset of size max features are used to find the optimal split.

Classification Using SVMs
SVMs are a set of powerful models for regression, classification and outliers detection [40]. SVMs with Gaussian radial basis function (RBF) is used because it confirmed to be effective in remote sensing applications [43]. To find the optimal parameters, cross-validation was utilised. To train the dataset, the best parameters C and gamma were used. Finally, the model was put to the test.
The penalty parameter of the error term is parameter C. A high C means classifying all training examples correctly. Best C gave the best results on the hold out data.
The parameter gamma is the kernel coefficient. In calculating the separation line, a low value takes only nearby points while the high value takes all the data points. Best 'gamma' that gave the best results on the hold out data.
Radial basis function (RBF) kernel K(x, y) is given by:

Evaluation of the Classification
The sklearn metrics module executes several score, and utility functions to evaluate classification performance including kappa index, overall accuracy (OA) and F1 measure. OA statistical index is a number of correct classified pixels/total number of pixels [44]. The kappa index is a discrete multivariate analysis tool for assessing the accuracy of classification maps. The kappa coefficient is derived from the error matrix and is used to evaluate how well the classification performs in comparison to the reference data. The kappa statistic is used to see if a categorization based on remotely sensed imagery is better than random.
The difference between the real agreement (main diagonal total) and the chance agreement (row or column totals) of the matrix is the kappa coefficient of agreement.
Because it takes into account all elements of the confusion matrix, the kappa coefficient was recommended. The kappa can be defined as: observed accuracy chance agreement K chance agreement (2) and it is computed as: where: r -number of rows in error matrix, Xii -number of observations in row (i) and column (i) on the major diagonal, Xit -total of observations in row (i) shown as marginal total of the matrix, Xti -total of observations in column (i) shown as marginal total at the bottom of the matrix, N -total number of observations, included in the matrix [45].
The use of F1 (a combination of precision and recall) is a main difference between traditional RS and deep learning RS [46]. In most cases, neither precision nor recall can provide a comprehensive assessment of classification performance. Precision is the proportion of identified positives that are actually positive, whereas recall is the proportion of positives that are correctly identified. To compute the score, the F1-score takes into account both the producer's accuracy (PA) and the user's accuracy (UA, also known as precision and recall) [47]:

Results and Discussion
In this research a new innovative building detection model has been designed and developed which will reduce the system processing time and space complexity and give better recognition rate as compared to other existing techniques.
Spot-5 has been co-registered to the WorldView-2 image. The used resampling technique was nearest neighbour for resampling to the pixel size of 0.5 m. DEM was generated using ground filtering technique. Followed by calculating nDSM by subtracting DEM from DSM. Thirty check points were utilized for DSM evaluation. The resulted RMS was 0.1 m using RPC only and free of trees.

The Importance of Different Features
In this study, we hypothesized that derived attributes using GLCM matrix can be used to improve buildings detection. For GLCM construction and texture calculation, a small window of 3 × 3 pixels was used using ENVI 5.1 Software. Results revealed that GLCM variance and contrast were the best Haralick textural features for discriminating buildings and GLCM correlation was the least effective one.

An Assessment of Image Fusion
In this study, image fusion of multispectral Spot-5 image (10-m resolution) and WorldView-2 pan image (0.5-m resolution) has been achieved to improve the information appear in the images and to raise the reliability of the performance. Two images are co-registered and resampled to 0.5-m resolution.
Results show that image fusion enhanced the resolution. Three fused images gave better interpretation than multispectral image. Several indices were utilized to assess the performance of fusion. In this study, the spectral quality of fused images was assessed by: -standard deviation, -mean, -correlation coefficient, -bias of mean. Table 3 gives quality metrics of fused images. The results revealed that the mean of the original image was 58.415. The average mean of all bands of fused images using wavelet-PCA, modified-IHS, wavelet-IHS were 58.148, 58.070 and 58.070 respectively. The standard deviation of the original image was 16.199 and the average standard deviation of all bands of fused image using wavelet-PCA, modified-IHS and wavelet-IHS were 16.111, 16.287 and 16.287 respectively. It is concluded that wavelet-PCA gave the least pixel errors and the three fused methods preserved the maximum sensible to the spectral quality of the original imagery. Correlation coefficient has been computed for the three fusion methods. The average correlation coefficient of all methods was 1.00 which indicates that the fused image is similar to the original MS image and no spatial information is put in the fused image from panchromatic image.

Comparison of Classification Results from Different Schemes
Classification was executed using RF and SVMs. Nine schemes were tested for image classification, and impact of variables was assessed. Two class labels were investigated: Buildings, and Non-Building. In the first step, Spot-5 and WorldView-2 are fused (i.e. using wavelet-PCA method) and the resulted image was used as a predictor variable for the classification process as the first scenario. In the second step, eight texture attributes were generated from the fused image using Envi software to form the GLCM image and the generated attributes (GLCM image) were used as predictor variable for the classification process as the second scenario. In the third step, nDSM is layer stacked with the fused image to form a new fused image with nDSM that was used as predictor variable for the classification process as the third scenario. Finally, the classifications were compared to each other based on kappa index, overall accuracy and F1 measure. All these steps were repeated, for each fusion technique, once using RF and another one for SVMs. Tables 4 and 5 show comparison of the train model results using RF classification and SVMs classification for different fusion methods, respectively. Figures 6 and 7 illustrate visual comparison of classification maps using RF and SVMs for different fusion methods, respectively. SVMs were insensible to the distribution of underlying data [17]. SVMs were able to address high dimensionality problems and limited training samples. Also, it was able to generalize well, even with small data [17]. Other benefits comprise that no prior knowledge of the underlying data is required. On the other side, SVMs have a number of limits [48] such as it requires good kernel function and there is a scarcity of transparency in the outcomes. Whereas, RF advantages were ease of use, it requires fewer parameters and no intervention [49][50][51]. The main barrier of RF was that a large quantity of trees could make it too slow.  Our results indicated RF might outperform SVM for image classification as in [52,53].
It is clear that when textural features of colour images were used as input to the classifiers, the overall accuracy, F1 measure and kappa are improved in most instances. This followed by integrating nDSM as a layer with the fused image comparing to using the fused image only. The results demonstrated that the difficulty in classifying spectral complex urban settings might be controlled by incorporating textural features. The yellow colour represents buildings and the blue colour represents non-buildings Fig. 7. Classification maps of different schemes for urban detection using SVMs. The yellow colour represents buildings and the blue colour represents non-buildings Using fused images only overall accuracy (OA) reached 0.944, 0.954, 0.940 of wavelet-IHS, wavelet-PCA, and modified-IHS respectively using RF classification and OA reached 0.954, 0.965, 0.944 using SVMs classification.
Using textural features with fused images improved the accuracy of classification, and the OA reached 0.977, 0.972, 0.977 of wavelet-IHS, wavelet-PCA, and modified-IHS respectively using RF classification and OA reached 0.968, 0.973 and 0.967 using SVMs classification.
Using fused image plus nDSM revealed OA of 0.961, 0.967, 0.958 of wavelet-IHS, wavelet-PCA, and modified-IHS respectively using RF classification and OA reached 0.965, 0.973 and 0.958 using SVMs classification which is also improved the classification compared to using fused image only.
Based on the visual evaluation of the results we can state that almost all buildings were delineated successfully if they were larger than the expected minimum buildings area (30 and 50 m 2 ).
In these experiments, the error was mainly caused by the merchant ships because the classification of ships depends on size and area, a factor which may be near or close to that of the building. Based on the area and size, the classifier mistakenly categorizes the ships as buildings.

Conclusion
The contribution of this research can be summarized as developing machine learning-based image fusion framework for building detection with its evaluation criteria. This model solved the building detection problem connected to the irregular structure and closeness of different buildings in urban areas. Three image fusion techniques were used to fuse Spot-5 and WorldView-2 images. The three fused images gave better interpretation than multispectral images. It is concluded that wavelet-PCA gave the least pixel errors and the three fused methods preserved the maximum sensible to the spectral quality of the original image.
To sum up, first, the impact of data labelling approach on the classification results. Second, our outcomes outperformed other peer methods. Third, a model to integrate spectral, height and textural information was proposed. It was found that textural features enhanced the accuracy of classification, and the overall accuracy reached 0.977, 0.972, 0.977 of wavelet-IHS, wavelet-PCA and modified-IHS respectively using random forest classification and overall accuracy reached 0.968, 0.973 and 0.967 using Support vector machines classification. Fourth, based on the results, random forest gave better results compared to support vector machines. Fifth, the results obtained in this study indicated that the developed procedure could be effectively used to delineate the boundaries of the rectangular shaped buildings, buildings with irregular shapes and quite small size ones with reliable accuracy. Sixth, it is recommended to use classifier ensample for buildings extraction. Finally, the proposed method is a valuable and effective for the prevention and trace of unorganized urbanization and can be used with other images.
Future work includes applying the developed model with the usage of other images in different locations.