Detection of Outlier Observations in Piezometric Measurements: A Case Study in the Southern Region of Poland

One of the main modes of monitoring the geotechnical conditions of earth dams is piezometric measurement, which measures water levels in an open piezometer or water pressure in a closed piezometer. During piezometric measurements, various types of factors can cause disturbances in these measurements that take the form of systematic, accidental, or obvious mistakes. Before measurements from open or closed piezometers are analyzed, outliers due to coarse errors should be detected and rejected. Such observations may significantly influence the result of the analysis and cause erroneous assessment and interpretation of the phenomenon studied. To do this, statistical tests must be applied so that the doubtful measurement can be accepted or rejected at the assumed significance level. This paper uses five statistical tests for identifying and rejecting outliers: the Q-Dixon test, the Grubbs test, as well as the Hampel test, the Iglewicz and Hoaglin test, and the Rosner test. The aim of this article is to try to identify the most suitable test for periodic piezometric measurements. The scope of the study includes the analysis of piezometric measurements for the Czaniec Dam for the multi-year period 2017–2020.


Introduction
The construction of dams and the existence of the reservoirs they create are an essential part of society. Reservoirs make it possible to store water when it is in excess and to use it when it is scarce. Dams, on the other hand, make it possible to prevent floods by modifying the course of a flood wave and, to a large extent, lowering its peak. Hydroelectric dams operate under varying meteorological and hydrological conditions. Such facilities are constantly exposed to intense precipitation, floods, as well as landslides, lightning, and ice phenomena. The vulnerability of hydroelectric dams to damage or disaster increases with the time of operation. This concerns about 30% of the Polish damming facilities that have been operating for more than 50 years. Such a long period of operation, according to the assessment of the International Commission on Large Dams (ICOLD), results in a higher number of damage events and an increased probability of failure [1].
Embankment dams are the most common type of dam built today. Embankment dams are constructed from natural earthen materials, usually local soil, and rock [2]. In embankment dams, water seeps through the soil layers of the dam, and any change in this behavior may be an indication of emerging problems. Special attention should be paid to the safety of the dams since the number of disasters and major dam failures is increasing successively [3]. In the case of dams on earth and embankments, the most common cause of disasters was overtopping (31% of cases as the main cause and 18% as an additional cause), followed by internal erosion of the dam body (15% of cases as the main cause and 13% as an additional cause), as well as foundation defects, including settlement and slope instability (12% of cases as the main cause and 5% as an additional cause) [4][5][6][7][8][9]. According to ICOLD, in 15% of all cases, suffusion is the main cause of earth dam failure, while in 13% of cases this process is an additional cause [1].
The primary form of dam monitoring is by means of piezometric measurements [10]. These measurements make it possible to measure the level of the water table in an open piezometer or to measure the water pressure in a closed piezometer [11]. These measurements allow monitoring of the intensity of the seepage phenomenon through the dam [12,13]. An increasing (or decreasing) trend in piezometers can indicate the movement of fine particles in the body or subsoil of the structure, which over time can lead to a local exceedance of the permissible seepage gradients, leading to a situation that poses a threat to the safety of dam operation [14][15][16]. With systematic measurements, a possible catastrophe can be effectively prevented by activating warning or alarm systems, as well as planning the upgrade of the facility far in advance [17].
During the performance of measurements in piezometers, some factors may appear that cause disturbances in the obtained results. These can be systematic, accidental, or an obvious mistake. Before analyzing such data, coarse errors that may significantly affect the result and cause a false assessment or interpretation of the phenomenon studied should be detected and removed [40]. It is also worth mentioning the problem of the difference between an outlier and the detection of unusual geotechnical behavior. Geotechnical measurements are inevitably subject to various uncertainties [41][42][43][44][45]. The probability distributions of specific geotechnical parameters depend significantly on the quality of the measurements obtained, which are affected by measurement errors, changing measurement conditions (e.g. severe weather, icing, etc.) or other unknown environmental disturbances [46][47][48][49]. Estimated statistics of geotechnical parameters may be subject to high statistical uncertainty, and therefore it would be advisable not only to detect and remove outliers, but also to try to detect components that influence the formation of unusual geotechnical behavior, for which additional in situ and/or laboratory tests may be necessary [50,51]. For this purpose, for example, a probabilistic outlier detection method can be applied to sparse multivariate data obtained during geotechnical investigations [52].
Diagnostics of the structure and its monitoring allows to know the technical condition of the hydrotechnical object, it is especially useful in the assessment of water dam structures [53]. It not only means the adoption of a safety coefficient that guarantees the integrity and stability of the structure, but it also becomes an essential component of the risk of catastrophe caused by dam failure, where the risk is understood here as the product of the probability of dam failure and human and material loss caused downstream of the structure due to its sudden failure [54]. Hydrotechnical structures have large volumes and are exposed to continuous contact with water, usually surface water. The function of the water dam is of particular importance, since seepage has a strong influence on the object and the ground [55]. In addition to seepage, the condition of a dam is affected by contact with flowing water, which can result in erosion or siltation [56].
The principles on which the monitoring of hydraulic structures is based are contained in the following documents [57][58][59] This paper uses five statistical tests for identifying and rejecting outliers: the Q-Dixon test, the Grubbs test, as well as the Hampel test, the Iglewicz and Hoaglin test, and the Rosner test. The aim of this paper is to propose the most suitable test for determining outlier observations for periodic piezometric measurements. The scope of the study includes the analysis of piezometric measurements for Czaniec Dam for the multiyear period 2017-2020.
Section 2 shows the scope of the work, which includes the analysis of piezometric measurements for the Czaniec Dam over the multi-year period 2017-2020, and describes all statistical tests used to detect outlier observations. Section 3 briefly describes the object studied, Czaniec Dam. The area of the reservoir and its basic functions are defined. For the dam studied, its location and essential elements such as height of the dam, length, and width of the crest as well as downstream and upstream slopes and their insurance were presented. Section 4 tabulates the number of outlier results for all the statistical tests performed and discusses them. Section 5 describes the main conclusions of the study.

Materials and Methods
For the Czaniec Dam located in the Silesian Province, an analysis of changes in the measurements of water table, changes in open piezometers within the front dam and side dams (64 piezometers on total), in open piezometers at the dike (8 piezometers) and in wells (4 piezometers) covering the period from 17.01.2017 to 23.12.2020. Measurements were almost always made twice a month, which resulted in 95 measurement results for a single piezometer during the analysis period. Piezometric data were provided by the Regional Water Management Board in Krakow.
In this study, five statistical tests were used to identify and reject outliers: the Q-Dixon test, the Grubbs test, as well as the Hampel test, the Iglewicz and Hoaglin test, and the Rosner test.
The Q-Dixon test is used to check whether a particular data set has a result that is subject to coarse error. The prerequisite for the application of the test is its numerical size. Two variants of the test were used in this study: for the single outliers -the N9 test and for pairs of outliers -the N13 test. The Q-Dixon test can be used to reject only a single outlier or a pair of outliers from a data set [60]. Note that before performing the test described above, the set of piezometric measurements should be arranged in a nondecreasing sequence. Table 1 shows the pairs of hypotheses tested in each variant of the test.
To reject the hypothesis of the absence of an outlier (variant N9) or a pair of outliers (variant N13), the value of the Q statistic is compared with the value read from the table of critical values Q n of the Q-Dixon test in the variant N9 or N13 at the significance level α [62].
Null hypothesis (H 0 ): x (1) is not an outlier. Alternative hypothesis (H 1 ): x (1) is an outlier. Test statistic: Testing the pair of largest outliers Testing the pair of smallest outliers Null hypothesis (H 0 ): ) is a pair of outliers. Test statistic: is not an outlier pair. Alternative hypothesis (H 1 ): pair (x (1) ,x (2) ) is a pair of outliers. Test statistic: Source: [61] Before performing the Grubbs test, the set of experimental results, as in the case of the Q-Dixon test, should be ranked in a non-decreasing sequence. It is clear that the coarse error may be the largest (x max ) or smallest (x min ) result value in the sample under analysis. This test at one time, like the Q-Dixon test, only gives the possibility of detecting one outlier, so it should be repeated until no further outliers are observed in the data set [63,64]. The value of the G p test statistic for the Grubbs test can be calculated using the formula: where: x -mean value of the series of measurements tested, s -standard deviation.
The critical value of the Grubbs test statistic for the critical value of the Grubbs test statistic for the assumed significance level α can be calculated from the following formula [62]: Thus, it can be seen that the value of G kr is calculated from the critical value of the Student's t distribution for a significance level of α/(2n) and a number of degrees of freedom equal to n -2, where n is the number of piezometric measurements in the series.
Another test used is the Hampel test. An advantage of this test is its simplicity, as there is no limitation on the size of the set tested. The Hampel test is used to detect in the analyzed data set results that significantly deviate from the average values. This test also tends to generate a significant number of errors [60]. Inference about the nature of the observation under study is based on the evaluation of the obtained results of the analysis based on specific formulas. By performing the Hampel test, it is necessary to calculate the median value M e and then the deviations r i from the median and the absolute values |r i | and median deviations where: x -median of the piezometric data set, MAD -median absolute deviation, calculated as: where |x| -absolute value of x.
The authors of the test recommend that M i with a nonsingular value greater than 3.5 be considered an outlier.
The last static test used is the Rosner test [65]. The methodology to calculate outliers in this test is to run a series of test statistics and remove the measurement that is farthest from the mean and recalculate the test statistic according to the following equation [62]: where: and x (i) -the most distant value from the measurement from ( ) i x .

Case Study
Czaniecki Reservoir is a reservoir located in the municipality of Porąbka, Bielsko -Biała county, Silesian voivodeship ( Fig. 1). Figure 2 presents a situation plan for the Czaniec Dam. The water reservoir is located on 28.8 km of the Soła River. Its area is 43 ha. The main task of the Czaniecki reservoir is to equalize the daily flow of the Soła River, as well as to enable water intake for users downstream of the reservoir.  District water intake Source: [66] The Czaniec compensating reservoir is a dual-purpose reservoir because it lines up with the retention reservoirs (Tresna and Porąbka) and peak power plants that are located next to them. Optimal efficiency of power plants during the highest daily energy demand requires full power operation. It is ensured by the Czaniec compensating reservoir, whose usable volume is equal to the volume of water used during peak demand. The second task that the reservoir must perform is to evenly discharge the accumulated water into the river, so that because of the maximum operation of the power plants in the watercourse, there are no outflow waves that have the character of flood waves. The water levels in the equalization reservoir are subject to change, depending on the operation of the power plant, and the highest levels are recorded at the end of operation during peak periods.
The main dam is divided by a dike into two parts and its extensions are side dams. The right part of the main dam is 300 m long, while the left part, due to the emergency passage, consists of two sections with a total length of 248 m. The dam crest is 7 m wide and is located at the ordinate of 299.50 m above sea level. The slope of the downstream slope is 1 : 2.5, while that of the upstream slope is 1 : 2. In the right section, 15 m long from the abutment of the width of the dam structure, the crest width was extended from 7 m to 12 m, which allows easy maneuvering of vehicles that deliver equipment to the weir. Both the dam crest and the downstream slope are exposed to destructive weathering. In turn, the upstream slope and sometimes the lower area of the downstream slope are exposed to wave action. These surfaces require appropriate reinforcement [8]. The upstream slope is protected with a screen made of reinforced concrete slabs along its entire length. The bottom edge of the reinforced concrete slabs on the upstream slope is fixed in the apron made of clay and located at the ordinate of 294 m above sea level. Both the clay apron and the gravel cover are 0.5 m thick. The width of the apron is constant and equals 20 m except for the section of the Soła old riverbed where it increases to 25 m.
The drainage of the dam is made in the lower part of the vent layer in the form of a trapezoidal prism of stone and gravel. Drainage is an element that enables the intake and discharge of filtration and groundwater from the protection zone, reducing the zone's waterlogging and load caused by water pressure or filtration. The drain is equipped with a reverse filter, which prevents the leaking water from washing out the soil particles. The material used for drainage is characterized by high strength and resistance to frost. The dike was made in the dike, the crown of which has a width of 1 m and is located at the ordinate of 299.00 m above sea level. The dike formed in this way will easily be washed out if the water level exceeds 299.00 m above sea level in the reservoir. To prevent this, the bench and spillway scarp were reinforced with 15 cm thick cobbles. The overflow drainage slope is covered with cobbles, on which a clay-covered screen was laid. The layers made in this way seal the whole body against the overflow. The downstream slope is made in the same way as the upstream slope of the main dam and is grassed over with a layer of humus. The overtopping body has a fixed sill to reduce the level of scour and is made of a steel wall with a reinforced concrete ring, a wooden palisade, and stone insulation. The river abutments are angular reinforced concrete retaining walls and 6.5 m high. The 15 m wide overbank floodplain towards the reservoir has been insured with concrete slabs [66].
The Czaniec Dam, as well as other facilities connected with it, is equipped with control, and measuring devices used to check the compliance of construction works with the design documentation. The equipment is distributed in a network and the type of devices, number, and place of mounting of measurement points must be selected in such a way that it is possible to trace the intensity of phenomena, probable safety hazards, and durability of the structure.

Results and Discussion
The results obtained are summarized in Table 2. Furthermore, Figure 3 shows an example of changes in the graph of water level changes in the P10B piezometer (with a clearly visible outlier) at the Czaniec Dam before identification and removal of outliers, while Figure 4 shows the graph of changes in water level in the same piezometer at the Czaniec Dam after identification and removal of outliers.       Figure 5 shows a graph of changes in water level in piezometer P10A, for which the Hampel test revealed 20 outlier observations (Tab. 2). Analyzing this graph, it can be seen that this piezometer reacts to changes of water level in the reservoir (seasonality), and it can be certainly stated that in this data set there are not so many outliers. The large number of outliers detected for the Hampel statistical test (for example, for piezometer P10A or P2) is related to the design of this test. When there are many observations with the same or similar values in the data set, when calculating the value of deviations r i from the median value we get a result equal to 0. Then the median value of the set of deviations is also obtained equal to 0, so if the modulus of deviations r i takes a value greater than 0 then the observation is treated as an outlier.
The Iglewicz and Hoaglin test behaves similarly and shows a similar number of outliers detected; therefore, it can be concluded that both the Hampel test and the Iglewicz and Hoaglin test are not suitable for detecting outliers in piezometric measurements as they are too rigid.
In Table 2, for the P10B piezometer, it is easy to see that the Q-Dixon N13 test detected two outliers, while all others detected only one. These results illustrate the important differences between the Q-Dixon test for pairs of observations (e.g., variant N13) and the test for a single observation (e.g., variant N9), which, due to the design of the test statistic, is unable to detect the pair of largest or smallest outliers. On the other hand, the test for a pair of observations may not be an effective tool when there is only one outlier in the data set. In summary, these results therefore confirm that in practical applications it is worth bearing in mind the relative advantages and disadvantages of both types of Q-Dixon test: the test for a single outlier and the test for a pair of outliers. An important drawback of the Q-Dixon test is that only a single outlier or a pair of outliers can be rejected each time from the analyzed data set. Therefore, the procedure of identification and elimination of coarse errors for a measurement data set devoid of a previously rejected observation must be performed until all outliers are eliminated, which is certainly a cumbersome and time-consuming task.
The Grubbs and Rosner tests showed the existence of a very similar number of outlier observations. Only for three piezometers were differences observed: for the piezometer labelled P35D, P3, and P4. The Grubbs test, as with the Q-Dixon test, can only detect one outlier at a time and should therefore be repeated until there are no further outliers in the data set. The Rosner test, better known as the Extreme Studentized Deviate test (ESD), is a modification of the Grubbs test. The Rosner test can be performed iteratively by analyzing the most deviant values in turn, whereas in the Grubbs test the number of questionable values must be determined a priori. Before performing the Rosner test, the measurement set should be arranged in a nondecreasing sequence, and it should be checked whether the analyzed data set has a normal distribution. The Doornik-Hansen or Shapiro-Wilk test can be used for this purpose. The application of the test requires the maximum number of outliers r in the test sample to be given. This test is applied to sample sizes of n ≥ 25 observations in which up to 10 outliers are recorded. The ability to detect up to 10 outlier observations at a time is undoubtedly a very important advantage of this test due to the timing of the test. From the point of view of periodic piezometric measurements. The Rosner test therefore appears to be the most suitable test, as it is neither too rigid nor too flexible.
It is an undeniable fact that outliers that are clearly the result of undesirable performance should be removed. So, the question arises: what to do after removing an outlier from the data set? It is possible to replace missing data by the arithmetic mean of neighboring data in the corresponding cell [67]. It is worth remembering, however, that this procedure will reduce the spread of the population and make the observed distribution more leptokurtic and may increase the probability of making a Type I error (the error of rejecting a null hypothesis that is not in fact false). A more complicated multiple imputation method can also be used, which involves replacing outliers (or missing data) with possible values [68].
Error assessment is a set of issues at the intersection of mathematics, metrology, and statistics that deals with the evaluation and analysis of measurement uncertainty. It encompasses the principles of elaboration and presentation of experimental results. The results of any measurement without error analysis, more specifically without the determination of measurement error (defined as the deviation between the measurement result and its true value), are, in fact, treated as indications only. Measurement error is an intrinsic factor of the measurement process that does not arise only from a mistake. In fact, each measure and is influenced by a large number of factors and variables, resulting in numerous sources of error, including imperfect sense, inaccuracy of measuring instruments and methods, and uncontrolled variability of environmental conditions. Some factors and variables cannot be fully controlled, and anyone can make some increase or decrease in the result. Recognizing the sources of error can help reduce them, but it is important to realize that we can never eliminate all of them. The error in a single measurement cannot be calculated as the difference between the result of the measurement and the true value of the measurement since this value is not known. It can only be estimated, or its components calculated. However, the procedure depends on the recognition of the interactions that affect the result of the measurement. Considering the types of interaction (accidental or systematic), the measurement errors are divided into accidental, systematic, and coarse. Most often the errors are of accidental nature, so the tools of mathematical statistics can be used to process the obtained results.
A single measurement result with an outlier error is usually an extreme value (minimum or maximum) of an ordered set of results. Coarse errors that can occur during piezometric measurements are caused by a number of factors, among which are: damage to the measuring equipment, change in measurement conditions (e.g., icing), incorrect numbering of points or accidental change in the order of two adjacent numbers, storage or preparation for analysis, improper use of the measuring equipment, mechanical damage to the measurement points, mistakes in reading or recording the readings of the measuring instrument, improper method of measurement (data collection) or improper entry of measurement data into the database.
Measurements that are considered questionable are among the most troublesome problems when performing any data analysis. Doubtful results are the result of a one-time influence of a disturbing cause that does not operate continuously and only affects certain measurements.
For measurement series that include the results of measurements made under repeatability conditions, such errors can be easily detected and identified. In case of periodic piezometric measurements, observations are recorded only once for each of the piezometers under study. Therefore, only independent measurements are available. Possible disturbances and changes during water level changes in piezometers can be noticed by comparing them with the image from previous measurement periods. As a rule, the person who performs piezometric calculations and analyses only receives the results of the measurements themselves, without providing additional information about their course. In such a situation, it is necessary to apply statistical tests by which it will be possible to accept or reject a doubtful measurement at the assumed significance level α.
The concept of error, as applied to scientific measurement, is closely related to the completely unavoidable uncertainty that is intrinsically linked to the essence of making a measurement using a given method. In this sense, errors do not characterize mistakes that can be avoided if greater care is taken in performing measurements. Therefore, one should strive to minimize the size of errors and find a way to estimate their magnitude.

Conclusions
All the statistical tests used in this research are used to identify outlier observations in measurement data sets. The purpose of this paper is to identify the test that would be most appropriate for periodically performed piezometric measurements. A very important part of a thorough analysis of measurement data is developing techniques to look for outliers and understand their impact on the analysis performed. Statistical tests based on sample mean and variance can be biased when outliers are present in the data set. Before analyzing piezometric data, it is important to reject coarse errors that, even for a single outlier observation, can significantly affect the result and cause a false assessment or interpretation of the phenomenon under study.
Considering all the static tests used in this paper to identify outlier observations, from the point of view of periodic piezometric measurements, it seems that the Rosner test is the most appropriate test. The test handles piezometric measurements well and is neither too rigid nor too flexible. A great and unquestionable advantage of the Rosner test is its short execution time, as it can detect up to 10 outliers at a time (other tests can only reject a single outlier (Q-Dixon test in the N9 variant or Grubbs test) or a pair of outliers (Q-Dixon test in the N13 variant). Moreover, Rosner test, unlike Hampel's test and Iglewicz and Hoaglin test, also performs well when there are a large number of observations with the same or similar values in the measurement data set and does not indicate them as outliers.
Working effectively with outliers in datasets is certainly a difficult and tedious task. If an observation is found to be an outlier by the chosen statistical test, then each time the analyst should attempt to explain this phenomenon before excluding it from further analysis and decide whether to remove the observation. If no explanation can be found, then such an observation should be treated as extreme but valid and included in further analysis.