An Evaluation of Some Machine Learning Algorithms as Tools for Predicting Soil Characteristics Based on Their Spectral Response in the Vis‑NIR Range

Using the Land Use and Coverage Frame Survey (LUCAS) database of European soil surface layer properties, statistical and machine learning predictive models for several key soil characteristics (clay content, pH in CaCl2, concentration of organic carbon, calcium carbonates and nitrogen and exchange cations capacity) were compared on the basis of processing their spectral responses in the visible (Vis) and near‑infrared (NIR) parts. Standard methods of relationship modeling were used: stepwise regression, partial least squares regression and linear regression with input data obtained from principal components analysis. Using the inputs extracted by statistical algorithms various machine learning algorithms were used in the modeling. The usefulness of the models was analyzed by comparison with the values of the determination coefficients, the root mean square error and the distribution of residual values. The mean square error of estimation in the cross‑validation procedure for the stack model using the multilayer perceptron and the distributed random forest were as follows: for clay content – ca. 4.5%; for pH – ca. 0.35; for SOC – ca. 7.5 g/kg (0.75% by weight); for CaCO3 content – ca. 19 g/kg; for N content – ca. 0.50 g/kg; and for CEC – ca. 3.5 cmol(+)/kg.


Introduction
The need for large amounts of soil data, when confronted with high costs and time -consuming implementations, justifies the search for alternative testing or survey methods: namely, aerial or satellite level spectrum analyses, and laboratory spectrum analysis [1]. Spectral response, as a source of information on the chemical and physical properties of soils, has been studied since the 1970s [2][3][4][5]. Progress in this field has been driven by development of multispectral registration technology for the Earth's surface from the satellite level, which requires improving and refining methods of interpretation of spectral responses to extract useful information on the state of Earth surface. In parallel, another branch of research using the spectral response of soil samples determined under laboratory conditions is rapidly developing, whose aim is to indirectly determine their characteristics obtained by standard, referencing methods requiring complex, expensive and time -consuming preparation. A fair number of research findings have been published demonstrating the usefulness of this both spectral response approach for modeling soil characteristics or indicating significant limitations to it [6].
There is a relatively extensive body of literature addressing the use of soil spectral response in the visible and near infrared range (Vis -NIR) for predicting various soil properties: clay fraction content, organic carbon content, reaction, macroand micro -elements contents, and cation exchange capacity (CEC), among others [6]. The conclusions are varied, being generally positive with respect to farm scale trials [7], yet generally more cautious with respect to areas with more diverse soil formation conditions. These results, however, are very difficult to compare, since those studies used differing ranges of recorded spectral data, sampling steps, and number of samples. In some cases, the use of models based on principal component analysis (PCA) and partial least squares regression (PLSR) gives satisfactory results for the prediction of soil characteristics; in others, a very complex and computationally demanding procedure of transformation and selection of variables can provide very good results as well [8].
From numerous research works it is reasonable to conclude that when tasked with interpreting the spectral response in the NIR range for the indirect determination of soil variables, various ways of data preprocessing, extraction of useful information from the extensive measurement data, and modeling algorithms of the relationships sought are attempted. The widest range in data transformation and selection is characterized by the PARACUDA II engine [9]. Second place should go to the PLSR algorithm, which could also be used (next to PCA) to select input data from other regression models (e.g., M5, Cubist, MARSpline, random trees, random forests, and other statistical and machine learning methods). An intermediate solution between the two potential approaches would make full use of soil spectral response vectors, as allowed by deep learning models, mainly through convolutional networks. The Convolutional Neural Network (CNN) represents a model of deep machine learning [10]. Its main application is image analysis, but with some modifications it can be used for regression tasks, also done with 1D inputs. Using CNN has several advantages. First, it does not require selecting spectral data, because the model inputs can be the whole vector of absorbance or reflectance (or some transformation of them), without the need to extract information by means of an additional algorithm (i.e., PCA, PLSR, genetic algorithm). Second, CNN allows for the simultaneous prediction of several variables; this possibility also applies to other neural models in regression applications, such as multilayer perceptron -MLP or radial basis function -RBF, similarly to PLSR). Third, fragments of the data vector that significantly affect the values of the modeled traits are identified in the course of learning. Fourth, the principle underpinning CNNs is based on detecting, in images (CNNs are currently the basic tool for image analysis) or in data vectors, patterns differentiating objects (classification) or data values associated with vectors (regression). Thus, the CNN architecture, apart from the elementary arithmetic limitations of the size of processing layers, is very flexible. Yet this significantly increases uncertainty in the choice of its processing parameters. The CNN models used in practice are generally large in terms of their number of parameters, optimization time, and the risk of overfitting (despite applying techniques to reduce these trends). Understandably, they require testing different architectures, optimization algorithms, and sample sizes. An important feature of modeling is the uncertainty as to the construction of an algorithm suitable for solving the prediction problem. This particularly applies to machine learning (ML) algorithms, which are characterized by considerable freedom in the selection of the number of optimized parameters. Equally important may be the problem of multivariate data acquisition, represented in the case of the spectral response by large sequences recorded at points with a specific reflectance wavelength. Under such conditions, an experimental selection of the model architecture and the method of extraction from data useful in modeling are inevitable.
The aim of this current work was to experimentally assess the acquisition of a large set of spectral data obtained from laboratory testing and the application of different algorithms for predicting soil characteristics based on spectral responses of soil samples, as collected by the LUCAS project, which includes the testing of topsoil surface layers in more than 20 European Union (EU) countries [11,12].

Materials and Methods
The LUCAS database, made publicly available for research purposes by the European Soil Data Center, contains the results of more than 20,000 laboratory tests of soil surface layer samples from 23 EU countries [11]. In addition to data identifying sampling location, land use, soil type and topography, this database includes the values of 12 soil characterization variables (determined by methods considered as standard) and results of reflected spectrum recordings in the 400-2,500 nm range, in 0.5-nm steps. The data contained in the database were obtained in the same laboratory, according to uniform methodology. Hence, we may presume the factors differentiating the properties of these samples arise mainly from the spatial variability of soils.
The 17,216 mineral soil sample records collected in the LUCAS database were divided, without any soil -use differentiation, at random, into a training (teaching) part for the optimization of prediction models of 12,898 sites. For the model evaluation of independent data, it used 4,318 data sites. This data included the results of the determination of the surface properties of mineral soils' surface layer. For the statistical models, the whole training set was used, but for machine learning models, depending on the applied learning algorithm, an additional test set (15% of training set data) were separated from the training set, in order to apply the early stopping principle. In some of the experiments, the k -folds validation was used (indicated in the text accordingly).
Soil data included, inter alia, the following characteristics: clay, silt and sand content, reaction (pH in CaCl 2 and pH in H 2 O), organic carbon content (SOC), carbonate content (CaCO 3 ), N, P and K content and cation exchange capacity -CEC [11]. Considering these traits individually, their values had statistical distributions that differed from normal distributions (Tab. 1, Fig. 1): they are characterized by positive skewness -except for slightly negative skewness for pH -and by extremely high potassium and CaCO 3 contents, with a strong data concentration at the distributions' mean and a relatively large range of data lying distant from its median, especially for carbonates, phosphorus, and potassium contents. The soil properties listed in Table 1 were those used for modeling based on their spectral response; hereon called "soil variables".    Figure 1 shows boxplots of the eight selected soil variables. It indicates, in general, mostly asymmetrical distributions in the data for each. According to the usual interpretation, these boxplots indicate a considerable presence of outlier observations in the data. They are represented in the diagrams by points marked with red crosses, located at a distance of more than 1.5 interquartile distance (IQR), below the lower or above the upper quartile. According to usual practice, this would entail their exclusion from modeling; however, this would require the suction: samples with a clay fraction exceeding 53%, a silt and sand fraction content and a distribution of both pH-values would be fully within the acceptable variation, whereas samples with an SOC content greater than 62 g/kg (6.2% by weight), as well as samples with carbonate content greater than 45 g/kg (4.5% by weight), 4.35 g/kg N and CEC greater than 36 cmol(+)/kg, would be discarded. Presumably the presence of outliers in the modeled variables may come from measurement errors not excluded by chance, or represent the actual data distribution. Due to the diligence of data acquisition [11], it can be assumed that this soil data correctly reflects the distribution of each soil variable, while the removal of the indicated data would distort the actual picture of soil conditions' variation. Doing the latter would run against the aim of the experiment, which was to assess the possibility of obtaining a universal prediction model, which may be characterized by a specific, even significant, prediction error for the soil variables.
In the modeling tests, in addition to the spectral absorbance data of the samples in the range 500-2500 nm (matrix of absorbance of soil samples marked with an X symbol in this paper), pre -processed data were used to adapt them to the format applied in the spectral response analysis (refer to [6]): that is, first absorbance derivative (matrix dX), second absorbance derivative (matrix d2X), sample reflectance (matrix RefX), first absorbance derivative after Savitzky-Golay filtering (matrix dsgolayX, framelength of 11, order 4) and continuum removal [13] reflectance (matrix CRX). Transformations, based on absorbance values provided by ESDAC, were performed in the MATLAB environment [14]. Figure 2 shows the transformed spectral data vectors of the validation set extracted from LUCAS.
In testing models for their predictions of soil sample characteristics on the basis of spectral response in the Vis -NIR range, experiments were carried out using statistical algorithms, machine learning, and their combinations, as follows: 1. Linear and MLP models with inputs obtained by stepwise regression. Input variables of the model were JX matrix, horizontal concatency matrices X, dX, d2X, RefX, ds -golayX, and CRX (a total of 23,996 columns). The variables modeled were clay fraction content (Clay), pH (pH in CaCo 3 ), SOC content (SOC), CaCO 3 content (CaCO 3 ), N, P, K contents (respectively: N, P, K), and cation exchange capacity (CEC). Subsequently, separate models were developed by stepwise regression method for individual soil variables (forward regression, p -enter = 0.05, p -remove = 0.10). In the second stage, the variables extracted through stepwise regression were used as inputs in the machine learning models (i.e., individual MLP models for soil variables  [15], the elements of the JX matrix were divided into nine homogeneous groups. The algorithm groups multidimensional variables, by using the criterion of similarity of their vectors. For individual clusters (i.e., homogeneous groups) and modeled variables (Clay, pH, SOC, CaCO 3 , N, P, K, CEC.), separate machine learning models (MLP) were developed, whose inputs were components extracted through stepwise regression. The calculations were performed in the MATLAB [14] environment. 5. Cubist regression models with PLSR components as inputs. This included the following data processing steps: (1) for some soil variables (Clay, pH in CaCl 2 , SOC, CaCO 3 , N, CEC) and spectral data (X, dX, d2X, RefX, ds--golayX, CRX), 25 components of the partial least squares regression (6 × 25 component vectors per soil variable) were calculated); (2) the vectors from the previous step, for each soil variable separately, were combined to form an input vector of 150 input components; and (3) separate regression models were developed for the soil variables using the Cubist algorithm [16] implemented in the R statistical platform [17]. 6. Stacking regression: 0-level data -PLSR components of the spectral data matrices, 0-level models -MLP models, 1-level data -horizontal concatenation of MLP models prediction results, 1-level models -various machine learning algorithms. This approach [18,19], was obtained in four steps.
(2) For each of the matrices of the PLSR components created in step one, three MLPs having 5, 10, and 20 units in the hidden layer per soil variable were created (18 MLPs of the models in total per soil variable, composing a collection of 0-level models). (3) The horizontal concativity of 0-level prediction models in the matrix of predictions (108 estimates = 18 × 6 soil variable), were 108 elementary 1-level data, consisting of independent estimates of soil variables.
(4) Using a selection algorithm, called 'Boruta' [20], for variables relevant to variable prediction, were selected as inputs; specifically, the vector of combined predictions created input variables of the below machine learning models served as a 1-level model in the stacking regression, based on indication vectors of 108 weak regression algorithms (corresponding to the 0-level of the stacking model). The following algorithms, as 1-level model were used: (5a) variants of MLP models, with 3, 6, 7 and 10 units in their hidden layer, for which the model with the smallest RMSE values for test data was selected; (5b) on the basis of the same input variables, individual models for soil variables were developed according to Quinlan's M5 decision tree algorithm [16], by implementing M5P of the WEKA [21] calculation package; (5c) the M5 algorithm implementation named Cubist was used [17,22]; (5d) decision module Gradient Boosting Machine (GBM) random tree algorithm, run in the software platform H2O [23]; (5e) the Distributed Random Forest (DRF) decision module [23,24], also implemented in H2O platform. The Cubist and M5P packages are an implementation of Quinlan's M5 algorithm combines a conventional decision tree with the possibility of linear regression functions at the nodes. Data assigned to the nodes of a random tree are used to build a linear model instead of averaging [16]. GBM is a machine learning algorithm in which a set (random forest) of multiple weak regression (or classification) models is used to iteratively optimize the prediction. The algorithm is iterative incrementing the regression tree architecture taking into account the value of the residuals from the previous state of the model [25]. DRF is a machine learning algorithm that iteratively optimizes a random forest structure by generating trees that use data from random subsets of training data [24]. 7. Convolutional Neural Network (CNN) with a 1D input and a regression output consisting of six units corresponding to the eight soil variables examined (clay, pH in CaCl 2 , SOC, CaCO 3 , N, CEC). The CNN network architecture was used with a 1D input in the Keras environment (implemented in the R Studio as well as in Anaconda-Python language environment), with a TensorFlow library. Reflectance vectors and SNV-transformed absorbance vectors [13] ere used as inputs to the model. The CNN architecture consisted of four convolution 1D layers with a 50-75-unit wide filter (in first convolutional layer), and 20-units filters (subsequent layers), with, respectively 32, 64 128 and 256 of filters. The data from the convolution layers were then transferred to the MaxPooling layers, while the model output included the layers "flatten" (transformation of data matrix into vectors), "dropout" (removal of small -scale connections), and two "fully connected" layers, which are equivalent to an MLP network with an output of six linear units. For the processing layers (other than the output per se) the ReLU transfer function was used. When not able to determine, a priori, the appropriate architecture of the CNN network, many attempts were made regarding the size of evolutionary filters, their number, as well as the size of fully -connected layers. Three optimization algorithms were used: Adam, Adadelta, RMSprop, in addition to the size of the "batch" (batch) data and number of optimization cycles [26][27][28][29].
The following statistics were used to evaluate the data distributions and prediction results obtained:

Results
In the scientific literature on the use of NIR in soil research, the most frequently modelled feature is the organic carbon content (SOC). It can be assumed that the monotonic dependence of the spectral response and the feature identifies the spectrum range best for quantifying the feature value.
The Figure 3 graphs show the curves of Spearman's correlation coefficients values (rho) for the transformed absorbance vectors and organic carbon content (SOC) in the soil samples. (Note that the curves relate to the SOC variable and are only relevant for this relationship.) From their analysis, the following conclusions may be drawn: (1) For the range of SOC concentrations, it is not possible to distinguish that part of the spectrum where the spectral response would allow an error -free estimation of the SOC content, on this basis; hence, it can be assumed that the linear model may show a lower usefulness in this range. (2) Both absorbance and reflectance are relatively weak predicators of SOC: a flat course of their curves ρ (curves are mutually symmetrical with respect to rho = 0) suggests mainly the role of the ordinates of reflected or absorbed signals, as an indicator of SOC concentration. (3) Absorbance derivatives, filtered and unfiltered, and the removal of the continuum, jointly indicate the presence of spectrum fragments, for which the influence of SOC concentration -or factors correlated with it -upon the growth of reflected or absorbed signals is conveyed. (4) The curves' form may indicate the need for the simultaneous use of different transformations of absorbance and reflectance vectors as input information for soil variable prediction models. Given the size of the input vectors, it was necessary to extract the most useful information, best achieved by statistical algorithms: stepwise regression, partial least squares regression, and of principal components analysis.  Table 2 contains the statistics for all seven models of the eight soil variables whose input vector was the JX vector of combined data. The step regression model especially the comparison of RMSE errors from training and validation data, indicates the differentiation among models. Because differences in empirical distribution parameters (i.e., mean and standard deviation) of the training and validation data are very similar, we should presume they are differentiated by distributions in the spectral response as well as random interactions of other soil elements not included in the models. It should be noted that the step regression algorithm extracted from almost 24,000 variables (combined vectors of transformed reflectance values) several hundred (from 344 to 558) variables that significantly impacted the prediction of selected properties. Yet the evaluation of regression models varied. The error of clay fraction content estimation, both for the training and validation sets (4.8% and 5.7%, respectively), may be deemed acceptable, leading to a possible texture classification error of no more than one group.  Prediction of the reaction (pH in CaCl 2 ), could also be considered acceptable, but only for a rough estimate, while the SOC estimation error (8.7 and 10.2 g/kg respectively, against the median value ca. 19 g/kg) was relatively high, especially for soils with low organic carbon content (e.g., sandy soils). The high determination factor of the model for carbonate content, given the asymmetry and significant dispersion of this characteristic in the sample, generated a relatively high RMSE estimation error (ca. 30 g/kg, against the median value ca. 1), which may disqualify it for the vast majority of soils having low carbonate content. The nitrogen (N) content was estimated with relative accuracy analogous to SOC, probably due to the correlation of these two components. The concentrations of P and K were estimated with significant error, while some error in estimating CEC is acceptable for soils with a higher clay fraction.

Linear and MLP Models with Inputs Obtained by Stepwise Regression, PLSR and PCA (JX Datamatrix)
The decision on the number of components used in the PLSR model is arbitrary, although it may be determined a posteriori. Figure 4 illustrates the increase in the determination factor as the number of PLS components is increased. Most of these graphs show ca. 20 to 50 such components are sufficient to model output variables close to the maximum accuracy offered by this algorithm. The graphs also indicate that predicted N, P, and K values are relatively imprecise. Comparing the results (R 2 and RMSE) of prediction data validation between the step regression algorithm and PLSR suggests the latter (partial least squares regression) is the better tool for building a soil prediction model (though differences in prediction errors are nevertheless quite small). The construction of neural prediction models, in which inputs are explanatory variables selected in a step -by -step regression procedure, or components obtained in the partial least squares regression procedure, may improve prediction quality since this includes natural non -linearity of the machine learning algorithm architecture in the model. But this must be done iteratively, because, despite a significant reduction in the number of explanatory variables (down from >20,000 to several hundred), the number of input variables remains relatively high. It should be noted that in the case of architectures with non -local transfer functions, increasing the number of hidden layer units multiplies the number of parameters that must be optimized in the learning process, thereby elevating the risk of overfitting and greater validation error (decreasing the ability to generalize the model). For MLP models developed here for soil variables contained in the LUCAS database, with inputs from step regression or PLSR components, exceeding the size of the hidden layer by 2 or 3 units (values determined in a multiple -trial process) significantly increased their validation error. Generally, the validation errors of MLP were smaller than those of statistical models (step regression and PLSR), while mostly smaller for those models whose inputs were determined via step regression procedures.

Linear and MLP Models for Clustered Data
One way to increase the accuracy of the predictions worth considering is to group the vectors of the input fields according to their similarity, and then create separate spectral response libraries. Greater forecasting accuracy may result from clustering the inputs to such sets for which separate forecasting models can be constructed, based on input data that are more homogeneous than for the whole population. In clustering the JX vectors, the fragment covering the first derivative of absorbance was used, since the experiments showed this had the largest quantitative share of the set of inputs emerging from the stepwise regression process. Figure 5 shows the distribution of the vectors of the first absorbance derivatives on the SOM map. The starting point of the SOM clustering process is random, making the final clusters' layout on the map unalike in different attempts, though still maintaining the spatial relationships and mutual distances between the clusters. For each cluster, prediction models were created with inputs obtained in the stepwise regression procedure of JX vectors belonging to that cluster; in each, the prediction model was the MLP algorithm, trained by Bayesian regularization in three repetitions, having 2, 3 and 4 units in the hidden layer. The model with the lowest MSE value for test data was then selected. Using this latter model, the values of soil variables in the validation set were predicted. Figure 6 shows the distribution of mean square error elements from prediction models for soil variables in SOM clusters. Highlighted by bold values are those clusters in which RMSE values of the validation set are smaller than those obtained on the basis of MLP models without division into clusters, with inputs obtained in a stepwise regression procedure and PLSR. In addition, the validation for all soil variables did not give better results than those in Table 2. For several soil variables, in some clusters the RMSEs were smaller than those of the overall models. Yet, apart from cases of strong local relationships between the spectral response and a soil variable, there is also the phenomenon of clusters featuring very weak predictive models but very low dispersion of a soil variable. Such circumstance arose in the case of the third cluster and SOC variable (Fig. 7), where the variables' dispersion is quite small, generating a low RMSE value with a low R 2 = 0.52. The bottom panel depicts all points together in a single graphic. Each panel above it shows RSMP data values for a given cluster

Cubist Regression Models with PLSR Components as Inputs
Committees of classification and regression models provide better quality prediction in many cases. Nevertheless, the models themselves can generate relatively large forecasting errors. Combining their estimates, either using separate data or coming from different types of models, can improve the ability to better generalize. A key issue here is choosing the way the team infers this. The classic solution is to create what is called a stack system. This consists of many models with different properties (differing inputs, architectures, and ways of optimization) and a decision algorithm, optimized in relation to the same outputs, providing a solution based on the indications of 1-level models [18]. Table 3 contains statistics on validations of six stacked models whose inputs were the vector of combined (concatected) PLS components obtained from spectral data X, dX, d2X, RefX, dsgolayX, and CRX along with relevant soil variables from the PLSR procedures. Qualitatively, these models were characterized by higher RMSE values for the validation set than those models created solely on the basis of PLS components extracted from the whole JX vector (included in Table 2). The presented models, however, are not a strict realization of the idea of stack regression, in that its inputs are not estimates of modeled variables but rather are values of PLS coefficients. Table 3. Statistics of Cubist models whose inputs were connected by the first 25 PLSR coefficients from vectors to X, dX, d2X, RefX, CRX, ds -golayX outputs (soil variables). Cubist algorithm employed the "boosting" option, and predictions with tuning took into account the averaging of five nearest -neighbors

Stacking Regression
It can be assumed that the inclusion of nonlinear models (e.g. MLP or other machine learning algorithms) will improve the prediction quality of the stack regression algorithm. Moreover, concatenating the predictions of all 0-level models (for all soil variables) and thus creating a 1-level data may improve the quality of the prediction by taking into account the interrelationships between soil variables. The 1-level model serving as a decision module remains an open issue. Table 4 contains the statistics of a stacked regression model with various inference modules, in which inputs were estimations of soil variables made by the MLP model set. For a given soil variable, differences between the element values of the root mean square error were relatively small: 5.02-5.38 for clay content, 0.39-0.41 for pH, 8.5-9.4 for SOC, 26.4-29.6 for CaCO 3 , 0.58-0.61 for N, 3.9-4.2 for CEC; however, the relationships between the values of determination factors and RPIQ were similar. Machine -learning models are, in part, random, and decisions made by their designer are subjective and devoid of objective premises for using a particular architecture. It can only be stated that among the examined models, a specific architecture has the best properties in terms of a specific evaluation criterion (Fig. 8). From the testing, in terms of R 2 and RMSE, the model with the DRF decision module is best, although differences among models' predictions were relatively small. Quality indicators of models are their RPIQ values and observed distribution parameters of residual values (leads). Relatively high RPIQ values were recorded for the prediction of clay content (RPIQ = 3.77), pH (RPIQ = 6.47), and CEC (RPIQ = 2.92), whereas that SOC model had a relatively low value (RPIQ = 2.19), likewise for N (RPIQ = 2.03). For CaCO 3 content, this statistic was at its lowest (RPIQ = 0.77). Statistics for the remaining models (i.e., using validation data) indicated a very strong concentration of values in their distributions, as evinced by their corresponding RIQR (interquartile ranges of the residuals) and kurtosis parameters. Kurtosis, a statistic characterized by high inherent variance [30], is not considered a significant indicator of a distribution's shape, but in this case, it concerns sets of equal numbers and similar forms of distributions (Tab. 5). It should be emphasized that a high kurtosis value for a residual distribution, with similar values of RMSE and R 2 , indirectly shows it has more extreme spacing. Table 6 lists the ranks according to Spearman's rho and Kendall's tau between the values of predictions from applied models and observations. This revealed a significant difference between ranks statistics in comparison with the determination coefficients of some models, especially those used to estimate the CaCO 3 concentration in soil. The very high R 2 -values are mainly due to very high variance of modeled values' distribution, which may be of lesser importance for non -parametric statistical tests. An important criterion by which to evaluate a model is the relation of its estimation error to actual values of given variables. A greater tolerance for errors associated with larger values of modeled variables may be justified, in certain cases, by practical aspects of model use. The graphs in Figure 9 show boxplots of the model's residuals in arbitrarily determined classes of values of the observed variables. Except for the pH model, the remaining distributions indicate an increase in residuals with higher values of observed variables. For smaller values of the variables, their estimation error is less than at larger real values. For example, when analyzing the CaCO 3 model validation residuals of 0-100 g/kg (0-10%), its 'local' determination coefficient was 0.66 and the 'local' RMSE value was 9.9 (0.99%), set against an RMSE for the whole variation range of ca. 27 (2.7%), while the local value of the inter -quartile spacing of residuals RIQR = 1.58. It is thus easy to see that higher residuals values at higher variable values are due to non -linearity in the relationship between modeled and observed values.
Selecting a fixed validation set, especially when there is high variability of input values, entails a certain weakening of the model, associated with omitting a certain portion of input data's variability constituting the fixed validation set. The procedure of model implementation, after determining its suboptimal architecture, was based here on the use of all available data, supported by cross -validation statistics, which incorporates all available data in subsequent optimization cycles (i.e., k -folds validation). Statistics of the final models -their training set error statistics and cross--validation averages -developed in this way are presented in Table 7. From these Figure 10 follows that a prediction algorithm trained on a full set of data, due to its flexibility, would provide much better forecasting results. This makes it difficult to differentiate data and discern the influence of non -observable factors interfering with model outputs by modifying the spectral response. It should be stressed, however, that all algorithms and data pre -processing methods used here provide predictions having similar statistical characteristics.

Convolutional Neural Network (CNN)
The results of CNN modeling were comparable to predictions made using the previously described stack models with different 1-level models. Prediction statistics for training and validation data, with two versions of 1D input data, namely the reflectance and absorbance vectors transformed in the SNV procedure [31] are given in Table 8. Tests may use an architecture similar to that described in [10], with specific modifications regarding the number of processing layers, optimization algorithm, validation method, and the number of examples that constitute a single "batch" of network training.
In this testing, relatively worse results were obtained with inputs in the form of reflectance vectors than with the transformation of absorbance vectors by the SNV algorithm. The use of models with multiple outputs inevitably complicates the problem of completing their optimization. According to Table 8 the optimalowest values of validation errors of particular soil features -occurred in different optimization periods (from 400 to 1,600 epochs).
Continuing the optimization, especially with smaller 'batch' values, the number of examples taken into account at the same time in a single optimization step, resulted in an over -adjustment of the model: its systematic correction of training errors with an unchanged validation error.
One criterion of robust multidimensional models is the maintenance of quantitative relationships between the modeled variables. Table 9 presents the Pearson linear correlation coefficients calculated for soil data as determined by reference methods and obtained from the predictions for validation data. Obs The coefficients observed in the population and those calculated for predictions between soil variables are generally similar in term of the rankings of their values.

Discussion and Conclusion
The idea of using the spectral response in the Vis -NIR range as a source of information on soil properties has an extensive literature, with varied conclusions having been drawn on this approach's suitability and utility for characterizing soil states. The digitization of soil environment documentation places high demands on the number of soil properties determinations. Cheap and quick methods of obtaining reliable soil data are of great importance. For many years, indirect remote sensing tools have been used to obtain information about the state of the components of the environment. The use of indirect methods in soil research in the laboratory cannot be underestimated due to the importance of determining the characteristics of the soil profile, impossible from the aerial or satellite level.
A literature review on the possibilities and limitations of this methodology [4,6,32] cited the extreme values of statistics (R 2 and RPD) of 'spectral response -soil property' relationship models. According to that investigation, in studies by various authors, models' determination coefficients were obtained that ranged from 0.01 to 0.99 for clay fraction content, from 0.66 to 0.96 for SOC content, from 0.68 to 0.98 for N content, from 0.22 to 0.88 for pH, and from 0.07 to 0.93 for CEC. Similarly, large differences were found between their model's RPD values. Practically then, the models can range being from completely useless to close to ideal when it comes to relying on data -based algorithms. The reasons for this great variation are obvious: namely, the number of samples and variation in soil properties, the methodology for extracting useful information from the spectral data, the model design, the validation method used, among other objective and subjective factors. Such studies typically use numerous transformations of spectral response vectors, differing dimensional reduction algorithms, and different inference models and thus it would be difficult to uncover a regression algorithm that would be omitted by them. Additionally, using concepts of so -called "auxiliary variables" is often presented -such as the type of land use, geographical location, soil texture -which, in addition to the spectral response, may be used to improve the prediction of other characteristics. In sum, lacking a dominant and widely recognized algorithm of inference based on the spectral response of soils, different concepts of soil formation ostensibly must compete with each other.
The basic modeling methodology that combines dimensional reduction and extraction of spectral response fragments relevant for modeling is dimensional reduction by PCA and PLSR [5,32], especially for small datasets which, due to their low soil variability, provide acceptable values for prediction errors. For years, the results of various machine learning methods have been published in combination with using such statistical methods (PCA or PLSR) for extracting input variables. This concept is perhaps best known as the PARACUDA II procedure [9,33], which combines multiple sampling of a data set, dimensional reduction, and sets of prediction models. This framework is characterized by a high quality of prediction (RMSE: 0.17% in SOC prediction, 5.4 cmol/kg CEC, 5.8% for CaCO 3 content), although it applies to relatively small sized data sets (sample size of ca. 100).
For ca. 30 years, deep processing has been the focus of interest in image analysis, offering an alternative to so -called "shallow" models. The use of CNN models for regression and classification tasks is now becoming common. Among other things, the problem of the number of training sets necessary to obtain useful results using CNNs, which usually entail large processing structures, is being resolved. According to [10], the CNN model can improve prediction quality with an increase in the number of learning examples, while for shallow models, an increase in their number of learning sets beyond a certain limit is not accompanied by improved predictions. According to that study [10], a set of ca. 8000 examples (data from Brazil) used to optimize the CNN network gives better forecasting results than either the PLSR or Cubist algorithm. Specifically, the RMSE value of clay fraction content validation data was 6.7% for the CNN model (for the PLSR and Cubist: 7.3% and 6.9%, respectively), the CEC value for CNN was 1.3 cmol/kg (for PLSR and Cubist: 1.7 cmol/kg), and organic matter content for CNN was 3.8 g/kg (for PLSR and Cubist: 5.0 g/kg and 4.8 g/kg, respectively). These values, however, cannot be compared with estimates of errors for other data, since they depend on the distribution and variability of the modeled soil variables.
Since its initial release, the LUCAS database has been widely used to develop various models to predict soil properties. One of the first studies using it was a publication on soil organic carbon modeling [34], in which the SVM model (RMSE = 8.9 g/kg) gave the best result for mineral soils (no use distinguished). In the models developed for particular types of land use, the results were not alike: for arable land, grassland, and forest areas the RMSE values obtained were respectively 4.9 g/kg, 9.3 g/kg and 15.0 g/kg (SVM and Cubist models); the addition of an auxiliary variable to the input variables (content of sandy or clay fraction in the sample) significantly improved their prediction. In the case of mineral soils (again, without distinction of use), this addition reduced the RMSE to 7.3 g/kg. In another study [35], LUCAS data were analyzed for modeling based on spectral response, SOC, N and clay fraction content. The modeling algorithm included the extraction of variables via PLSR (the number of factors considered experimentally ranged from 42 to 78). The corresponding RMSE values for the SOC model, for arable land, permanent grassland, and forests were respectively 6.0 g/kg, 10.9 g/kg, and 13.8 g/kg. Likewise, the clay fraction of those habitats was modeled with RMSE (in the same order) of 5.5%, 6.2%, and 5.4%, with corresponding values of 0.42 g/kg, 0.82 g/kg and 0.74 g/kg for nitrogen content.
Advanced models of machine learning (an MLP network, Boltzmann's restricted machine, and CNN) were used by [35][36][37], in their modeling based on the LUCAS soil database. Those modeling results are difficult to interpret as the authors provided RMSE values for standardized variables. For five modeled variables (sandy fraction content, pH, SOC, CaCO 3 , and P content), the mean RMSE value was 0.42 (CNN and a combination of Boltzmann's machine and the convolutional network), which corresponds to an RPD = 2.36. When recalculated to comply with LUCAS data (as declared by the authors), the RMSE values for a hybrid model (limited Boltzmann machine and convolutional network) were 7.4 g/kg for SOC, 0.41 for pH, and 25.2 for CaCO 3 content.
Work by Liu et al. [36] reported on the practical use of a pre -trained CNN network's properties. The construction of this network, to model the clay fraction content on the basis of its spectral response, was also based on LUCAS data for mineral soils. The trained network was first used to predict the clay fraction content in organic samples, after refining it on the basis of a small number of organic samples (RMSE = 7.07%). This was then used to estimate the clay content of soils base on multispectral imaging, after fine tuning the pre -treated network on a small number of samples (RMSE = 8.62).
Spectral data and spectral data in combination with the auxiliary predicators of the LUCAS collection were also used in the advanced three -level MKL (Multiple Kernel Learning) model for SOC prediction [38]. The prediction variable was logSOC, the decimal logarithm of the value (SOC+1), for which the R 2 = 0.86 and RMSE = 0.13. Much smaller error characterized the model when an auxiliary variable was incorporated (RMSE = 0.1). It should be noted that, in the current study, the conversion of RMSE value for the stack model with DRF output (Tab. 7) to the logSOC scale indicates that it is ca. 0.115, which is slightly less than that (without the auxiliary variable) reported by Tsakiridis et al. [38].
Recently, Tsakiridis et al. [39] published the results of their research on the usefulness of 1D multi -channel 1D CNN's for the prediction of clay fraction content, SOC and N content based on the Vis -NIR spectral response using LUCAS database. The results presented in this study (RMSE for clay content about 4.8%, SOC about 10.96 and N about 0.66) are similar to those presented in our study. This applies to CNN models, while the stacked regression model has even slightly better characteristics.
Against the background of the results available in the literature, the applied regression models presented in the paper, in relation to the LUCAS database, can be considered comparable with the best prediction algorithms. The square roots mean square error estimation in the cross -validation procedure were as follows: for clay content, ca. 4.4%; for pH, ca. 0.36; for SOC, ca. 7.5 g/kg (0.75% by weight); for CaCO 3 content, ca. 19 g/kg; for N content, ca. 0.50 g/kg; and for CEC, ca. 3.7 cmol(+)/kg. These results do not compete with laboratory tests, but are sufficiently accurate to dense point observations of soil condition for digital mapping purposes. Unsatisfactory low CaCO 3 prediction accuracy. It should be noted, however, that the prediction error of this component is positively correlated with its content in the soil and is relatively small at its low concentration. In Polish soils (except for rare cases of soils rich in Ca), exceeding of CaCO 3 concentration 20-30 g/kg is observed in limited cases.
The LUCAS database contains soil sample data from areas with high geological, climatic, and land use diversity. In the testing with statistical models, machine learning models (MLP, trees and random forests) with and without data grouping, applied models and deep learning models, relatively good results were obtained when using applied models coupled to PLS inputs processed with MLP algorithms. These modeling results are consistent with those reported in the literature, provided that the results from different areas are truly comparable. Furthermore, the poor results obtained here for the effects of stratification of spectral data were probably caused by methodology, specifically establishing similarity on the principle of Euclidean distance of vectors, which was significantly influenced by fragments not related to the modeled soil variables.
It should be assumed that the practical use of the Vis -NIR spectral response in prediction of soil properties will be dominated by data -based models. The literature reports present various algorithms of model construction with satisfactory prediction quality. In cases of large and diverse databases, such as LUCAS, PLSR, linear models using PCA or linear step regression do not give fully satisfactory results. Large datasets require more sophisticated methods to identify correlations between the spectral response and soil variables, such as PARACUDA II, the three--level MKL, or deep models for example the convolution 1D network, which did not give the best results in the tests. The applied stacking model gives good prediction results, probably partly due to the presence of many soil variables estimations in the input vector. This can be considered a disadvantage if the aim is to predict one or two soil variables, however, the approach is relatively flexible and the choice of the decision module needs testing. This approach, requires a large training dataset, which, also applies to the convolutional model. The advantage of the convolution network lies in the possibility to develop a pre -trained model based on a large and diverse data set. The CNN model can be "fine -tuned" using a relatively small set of local conditions.
The comparison of the results of modeling soil characteristics from different regions is difficult due to the differences in the variability of the mineralogy, soil types, sample size and other factors influencing the spectral response and the range of data variability. The similarity of physiographic conditions is likely to have a positive effect on the prediction error due to the homogeneity of potential disturbances in the relationship between the examined features and the reflectance of the samples. In addition, the machine learning algorithms present in modeling are characterized by a great flexibility in fixing optimization parameters. This is a factor that forces the randomness of the prediction result and the uncertainty as to the best possible architecture. For this reason, it is impossible to indicate the objectively optimal prediction model.
The conducted tests lead to the following conclusions: 1. The use of machine learning models reduces the error of the prediction of soil features based on the analysis of the spectral response in the Vis -NIR range in relation to the statistical linear models: stepwise regression or partial least squares regression. 2. The relatively best prediction result was obtained using a stack regression model with 0-level data obtained from many MLP models. 3. Clustering of the input data, due to the presence of variables not related to the modeled dependencies, does not improve the prediction results. 4. The influence of factors disturbing the "Vis -NIR -soil features" relationship is visible in the attempt to use the stack regression algorithm with cross--validation: the RMSE of the model built on the basis of the entire set is much lower than the average RMSE values from the validation. 5. The prediction errors of some features (Clay, SOC, CaCO 3 , N and CEC) increase monotonically according to their true value. 6. The use of a single prediction algorithm for data from very diverse geological and soil conditions has significant limitations as a method that replaces traditional laboratory analysis. This does not rule out its usefulness as a data source in soil cartography.