《1. Introduction》

1. Introduction

In the cement production process, it is essential to monitor the quality of products, such as the fineness of rawmeal, the free calcium oxide content of clinkers, and so forth. However, online instrumentations for these indicators are costly and require frequent regular maintenance. In industrial practice, off-line analysis in the lab is often implemented for these indexes every 2 h or more, which results in untimely feedback for real-time control systems. These problems can be solved by soft-sensing techniques [1,2].

Soft sensing is essentially a regression machine that evaluates quality indexes in real time using other instrumental variables that are available online. That is, given the D-dimensional input variables X = (where each element represents an instance) and their corresponding output variables Y = the objective of the regression machine is to construct an optimal mapping function using the knowledge implied in the training data, which achieves remarkable prediction accuracy on the test set. Successful soft-sensing applications can be found in diversified industries such as petroleum refining [3], metallurgical processes [4], and energy management [5,6].

Soft-sensing models originate from multivariate statistical regression models, including linear regression (LR), principal component regression (PCR), partial least squares (PLS), and some variants with regularization strategies to balance the empirical error and complexity of the model, such as least absolute shrinkage and selection operator (LASSO) and ridge [7]. Kernel strategies have been extensively studied and combined with the aforementioned algorithms to solve the regression problem for nonlinear problems [8,9]. After that, machine learning methods such as k-nearest neighbor regression (k-NNR) [10], classification and regression trees (CARTs) [11,12], and support vector regression (SVR) [13,14] have been proposed for knowledge mining in massive data. To improve the performance of a single tree model, bagging strategies are implemented in random forest (RF) algorithms [15,16]. Similarly, the prediction accuracy of boosting algorithms can be increased by combining a series of iteratively learned weak machines [17,18], such as gradient boosting machines (GBMs) and extreme gradient boosting (XGBoost). Furthermore, breakthroughs in deep learning in image and speech recognition have caused neural networks (NNs) [19,20] to become one of the most popular methods in the field of machine learning, especially when the data samples are sufficient. This popularity can be attributed to NNs’ powerful feature extraction capabilities with specially designed structures [21].

Among these algorithms, k-NNR is the simplest and one of the most prevailing regression methods. It is widely used in machine learning problems because it does not require an explicit model structure or any prior knowledge for data distribution. However, the strategy to use the average output of its k-nearest neighbors (k-NNs) as the prediction result also leads to this method’s greatest disadvantages. Initially, the k-NNR algorithm employed the Euclidean distance metric for the measurement of sample similarities. However, the magnitudes of the input features can vary greatly; redundancies and correlations between variables can also be misleading, resulting in an unpractical distance metric. To cope with this problem, a generalization of the Mahalanobis distance [22] was proposed, which is equivalent to a weighted Euclidean distance between two linear projected images. However, in practical applications, the input features tend to have distinct contributions to the output variables. The key is to develop a reliable feature extraction model and apply the classical metrics, such as the Euclid distance and cosine similarity, to the mapped features. Locally linear embedding (LLE) reconstructs the samples in a low-dimensional space using the locally linear weighting method and achieves dimension reduction by minimizing the reconstruction error [23]. Nevertheless, the adjacency relation constructed by the classical Euclidean metric in a high-dimensional space cannotmeet the needs of all classification tasks. Thus, researchers usually try to transform the input features into a scaled space [24,25] and to get the weight coefficients to predict the label by means of local reconstruction in the space. However, this method is very dependent on elegant design of the transformation model. For example, in a fuzzy transformation, the basic function and the division of fuzzy intervals may have a great influence on the prediction result, because the meaningful information contained in the output labels is not made full use of. To address this issue, Weinberger and Saul [26] introduced the concept of Mahalanobis distance metric learning, which allows the inverse covariance matrix in the Mahalanobis distance to denote any positive semidefinite matrix. Similar to the idea of linear discriminant analysis (LDA) [27], the Mahalanobis distance metric is learned by maximizing the ratio of the average internal class distance to the average between-class distance. Xing et al. [28] constructed a convex optimization problem for metric learning by taking the average between-class distance as the optimization target and the average within-class distance as the constraint. This method has been applied to semi-supervised data clustering problems.

The above methods are mainly designed for classification problems. For regression problems, Nguyen et al. [27] established a convex optimization problem by maximizing the consistency of the input and output distances over a set of constraint triplets in the neighborhood of each instance. However, the researchers did not elaborate the solution for a transformation matrix A in metric learning; the weight matrix W is optimized only under the condition of a given transformation matrix A. Moreover, the tradeoff parameter C tends to have a significant impact on the performance of the algorithm. Linear metric learning (LML) has limited power in feature representation, especially for high-dimensional samples such as image and text data. Deep metric learning (DML) uses deep neural network (DNN) models instead of linear transformations to extract features in order to achieve metric learning [29–31]. One of the greatest differences between LML and DML lies in the form of the loss function. For example, Song et al. [30] minimized the distances between samples from the same class and maximized the distances with a margin from different classes. In general, these methods involve the construction of triplet sets, which consist of an anchor, a positive point, and a negative point. This implies that the methods cannot be directly applied to regression problems.

In addition, using the average of k-NNs as the output prediction often results in conservative result. Take the wine quality assessment dataset on University of California Irvine (UCI) machine learning repository as an example. The k-NNR algorithm does not distinguish well between particularly high-grade or inferior wines. So, how does an operator predict the label? First, the operator will identify the most similar cases to the current sample in the historical data as references and then modify the label according to the change of the input features. We summarize this process and propose the local quadratic embedding learning (LQEL) algorithm. However, the coefficient matrix of the quadratic embedding function is difficult to obtain. Fortunately, the matrix is dependent on the location of the expansion point—that is, the current sample mentioned above. Thus, the coefficient matrix can be estimated by NNs, taking the current sample as the input. However, an appropriate network scale must be determined; otherwise, the model becomes over-fitted. To this end, ensemble methods to integrate multiple NNs are utilized to improve the generalization ability of NNs model [20,32]. The literature shows that standardizing the output of the hidden layer in the network by batch normalization (BN) can prevent distribution changes during the training process [33], which accelerates the convergence of networks. It has been pointed out that the dropout strategy can improve the generalization ability of the NN [34]. Moreover, superimposing a certain intensity of Gaussian noise on sample data can increase the number of training samples and thus improve the robustness of the model [35]. In general, these approaches improve the generalization of NNs in two ways. First, they increase the number of training samples; second, they add constraints to the network structure, reduce the complexity, and thus improve the network’s predictive ability. This paper follows the latter route.

In this paper, metric learning is first accomplished to determine the neighborhood of a certain instance by maximizing the consistency of the distances between the input and output spaces. This makes full use of the information contained in the target labels and achieves the first step of the operators’ strategy. Then, a local quadratic coefficient matrix is generated by a well-trained NN to make predictions based on neighboring references; this prevents the model degradation caused by sensor drift and unmeasured variables by means of the differential compensation method. Furthermore, the other NN assigns weights to the predictions provided by different neighbors according to their confidence, which achieves a balance between the prediction errors and measurement noises, thereby minimizing the prediction errors. The parameters of these two networks can be optimized by end-to-end training with stochastic gradient descent (SGD) algorithms. Empirical studies on several regression datasets, including two practical industrial datasets from the cement production process and hydrocracking process, show that, in most cases, the proposed method outperforms the popular regression methods.

The rest of this paper is organized as follows. In Section 2, a metric learning model is introduced and the optimization problem is proved to be equivalent to a convex optimization problem. In Section 3, the framework of the proposed LQEL is presented. In Section 4, several empirical studies, including a validation using actual industrial cases, are reported. The conclusions and contributions of this paper are summarized in Section 5.

《2. Metric learning》

2. Metric learning

A metric distance is a function that satisfies the following, for any

(1) Non-negativity: the equality holds if and only if  

(2) Symmetry:  

(3) Triangle inequality:

Given a set of D-dimensional input variables X = and corresponding output labels Y = metric learning has to find an implied metric function with these training data. In this metric space, instances with similar output labels are gathered together, and dissimilar samples are pushed far away. Studies in this field focus a great deal of attention on Mahalanobis metric learning (MML) [26] due to its simplicity and clarity. In addition, this problem can usually be transformed into simple convex optimization, making it extremely convenient to find the global optimum. The model structure of MML is defined as follows:

where M is a positive definite metric matrix to be learned, and u and v are two different instances. The objective of MML is to obtain the optimal matrix M that meets the purpose of metric learning.

We hope to use the information implied in the output labels to guide the direction of metric learning. The basic principle is that similar input samples lead to similar target labels. The consistency of the distances between the input and output spaces, from a statistical point of view, can be described with the Pearson correlation coefficient. Therefore, the optimization problem is formed as follows:

where  is the square distance between the ith and jth instances in the target space,  is the difference in input space denoted as   and N is the sample number. Since the numerator and denominator of the objective function are homogeneous to M, Eq. (2) can be equivalently converted to the optimization problem shown in Eq. (3): 

Here, we prove that the above problem has a unique global optimal solution and that the solution can be obtained by relaxing the constraints. The reconstructed problem after constraint relaxation is shown in Eq. (4):

then, the first-order and second-order partial derivatives of g(M) to M are as follows:

where is the Kronecker product, and vec(M) is the column expansion of M. For , the inequality in Eq. (7) shows that the function g(M) is a convex function.

where  This implies that the constraints in Eq. (4) cause the feasible domain to be a convex set. Meanwhile, the second-order partial derivatives of the objective function J(M) to M are calculated as follows: 

In summary, the problem in Eq. (4) is demonstrated to be a convex optimization problem—that is, it has a unique global optimal solution [36]. Denote the optimal solution as M*; it then follows that g(M*) = 1. Otherwise, 0 < g(M*) < 1. Denote M' = M*/g(M*) and substitute this into Eq. (4). It is not difficult to verify that M' is within the feasible domain. In addition, the objective follows  which is contradictory. The conclusion indicates that the problem in Eq. (3) has a unique global optimal solution, which can be obtained by solving the convex optimization problem in Eq. (4).

《3. Local quadratic embedding learning》

3. Local quadratic embedding learning

Most k-NNR algorithms take the average weight of neighboring outputs as the prediction result. This will lead to moderate predictions, as these neighboring outputs will not exceed the maximum and minimum intervals of the neighboring samples. Given an instance x, its k-NNs  and their corresponding outputs an intuitive idea is to take the linear weighted average of neighboring labels as the prediction. However, the result determined by  always follows  which leads to conservative prediction results. To address this issue, we intend to establish a local linear mapping model between the differences in the two spaces, along with an independent model to distinguish the reliability of different neighboring predictions—that is, to assign different weights to the predictions based on different neighbors.

The scheme of the LQEL algorithm is shown in Fig. 1. To obtain the output label corresponding to sample x, the k-NNs are first determined using the conclusion of the metric learning in Section 2 (the ellipse in the left of the figure). Suppose a function  is learned to describe the mapping from the difference of input to the difference of output in two spaces. Then, for each  adjacent to the instance x, the jth estimation can be performed as follows:  Finally, the above prediction results are linearly combined with an appropriate set of weights to obtain the final output:

Denote the real mapping function from input to output as  and define  as the second-order Taylor expansion expanded at the point x0 in the δ neighborhood, where 

Then, for  the difference in the output space can be calculated as follows:

where  represents the δ neighborhood of x0 in the metric space defined in Section 2,  is the weight coefficient matrix of the linear mapping function.

《Fig. 1》

Fig. 1. The scheme of LQEL.

The result of Eq. (10) implies that a linear model could be designed for prediction in neighborhood. The matrix W expanded on different reference points can be estimated by an independent NN—for example, using an NN  to approximate the matrix as  . Considering that the parameter matrices  tend to be more stable than  in most practical circumstances, the NN required here should be much simpler than the one used to estimate the output label directly. In particular, when g0 is a quadratic function, the matrices do not change with the reference point. In this case, a simple linear NN could work well. In general, these procedures can effectively reduce the complexity of the model and improve the generalization. 

This strategy provides k estimation results for each instance, one from each nearest neighbor, but the reliabilities can vary considerably. From an intuitive perspective, the predictions given by distant neighbors tend to have high uncertainty. This implies that different weights should be assigned to each of the predictions. Prediction uncertainties caused by the presence of measuring noise can be restrained by the averaging method. Inspired by this idea, we intend to design a machine that generates different weights according to the relative location of the instance, which minimizes the expectation of mean square error (MSE).

Denote the measuring noise superimposed on the output label  as  , which is subject to a normal distribution  The error of the ith prediction is calculated asfollows:

where is the estimation given by the ith neighbor, is the estimation error, and represents the uncertainty. The target is to obtain a set of weights that minimize the objective H(w):

This problem can be solved with the Lagrange multiplier method and the Karush–Kuhn–Tucker conditions. In this problem, all the variables involved in Eq. (12) are and , and the optimal weight groups for x0's neighbors are determined by  Instead of solving the optimization problem, an NN is introduced here to generate different weights, whose inputs are the differences between the instance and its neighbors, that is, 

In summary, the specific framework of the proposed method is shown in Fig. 2. For a certain instance , the k-NNs are first determined with the results acquired by metric learning in Section 2. Denote these samples as  and the corresponding target labels as 

Second, the coefficient matrix   is calculated for the sample  with NN I, which is used for the estimation of ,  and the jth prediction for is calculated as  Finally, the weight  for each estimation is provided by NN II, taking  as input. The final prediction  can be achieved with a linear combination of 

In this paper, we introduce state-of-the-art strategies for NNs, such as BN and dropout. The MSE is employed as the loss function. The parameters of the proposed model, including the weights and biases in the two NNs, are optimized by the SGD algorithm.

《Fig. 2》

Fig. 2. The model structure of the proposed method.

《4. Empirical learning》

4. Empirical learning

In order to know how well the proposed algorithm works, we use real-world benchmark regression datasets along with two practical industrial datasets for verification. A series of classical approaches are briefly introduced for the purpose of comparison with the proposed method. Finally, the experimental results are reported with tables and figures.

《4.1. Descriptions of datasets》

4.1. Descriptions of datasets

4.1.1. Benchmark datasets

The details of the datasets [37–39] are shown in Table 1. For example, the red wine dataset shown in the first line contains 1599 samples. Each record contains 12 feature variables and a target label to be predicted. The objective is to establish a mathematical model to evaluate red wine quality through color, composition, and so forth. In this case, the quality of red wine is divided into nine grades from high to low, and only samples between the third and eighth grades are included in the dataset.

《Table 1》

Table 1 Details of the datasets used in this paper.

CASP: critical assessment of protein structure prediction; RMSD: root mean square deviation.

4.1.2. Powder fineness dataset

The aim of the first practical industrial application is to make online prediction of powder fineness in the raw meal preparation process. The details of this technological process are presented in Fig. 3. In the raw meal preparation process, raw materials that consist of three or four minerals are transported onto the center of the grinding table. The materials are continuously pushed outward across the rotating grinding table due to centrifugal force. Rocks are crushed into small particles by the squeezing of the grinding rollers and the grinding table before leaving the grinding disk. When high-speed hot wind enters the mill from the bottom, finer particles are blown into the chamber, while larger particles fall to the bottom and are transported back to the entrance of the mill by a bucket elevator. High-speed airflow driven by an induced draft fan brings those finer particles into a high-efficiency dynamic classifier, where unqualified particles fall back to the mill table along the cone and get reground. Fine products gathered from cyclones and the electric dust collector are finally transported into a homogenization silo for storage.

《Fig. 3》

Fig. 3. Process flow chart of the raw meal preparation process.

The most important indicator of this process is the fineness of the product, which further influences the product quality and energy consumption of the subsequent calcination process. However, samples are collected and analyzed every 2 h due to the limited capacity for manual analysis in the lab, resulting in time lags for real-time process control and further resulting in fluctuations in raw meal fineness. Therefore, the aim is to estimate the powder fineness in real time with other available and relevant online variables—that is, to achieve soft sensing for raw meal fineness.

All of the variables that may affect or represent the fineness are considered to be auxiliary variables. These include the current of the draft fan, the current of the classifier, the current of the driven motor, the current of the bucket elevator to transport the product, the current of the bucket elevator to transport the rejected slags, the differential pressure, the inlet temperature, the outlet temperature, the feed quantity, and so forth. In general, an 80 μm sieve residue and a 200 μm sieve residue are considered to be the indicators of raw meal fineness, with the former being more sensitive. Therefore, the dataset is constructed with 14 auxiliary variables and one output label, with a total of 959 instances (about 4 months).

4.1.3. Hydrocracking process dataset

The simplified flow diagram of a typical hydrocracking process is shown in Fig. 4. The feedstock is mixed with externally supplied hydrogen, which is heated to a specified temperature and then enters the two cascade reactors. The first reactor is loaded with a hydrotreating catalyst to remove most of the sulfur and nitrogen, as well as some heavy metal compounds. The second reactor, where the cracking reaction is completed, is loaded with hydrocracking catalyst. In these reactors, low-temperature hydrogen is directly added to absorb the heat released by the exothermic reaction to maintain a stable temperature. The reaction product passes through a high-pressure separator to recycle unreacted hydrogen and then passes through a low-pressure separator to separate some light gases. Finally, the separation of different components is achieved by a fractionation tower. Six kinds of products are collected: light end (LE), light naphtha (LN), heavy naphtha (HN), kerosene (KE), diesel (DI), and bottom oil (BO).

Due to the fluctuation in product prices and changes in the market’s supply and demand, the yield of different products must be relocated accordingly in order to maximize the total profit. Therefore, it is essential to accurately predict the yield of each product in time to guide the operation optimization. In this paper, we take the yield of DI as an example to establish a prediction model. In this problem, the sampling period is 4 h and the dataset covers a total of 15 months. Finally, 2052 samples with 55 related input variables, including the feed mass flow rate, volume flow of the fresh hydrogen gas, and so forth, are collected.

《Fig. 4》

Fig. 4. Process flowchart of the hydrocracking process.

《4.2. User-specified parameters》

4.2. User-specified parameters

Seven typical regression algorithms are involved in this work:

(1) MML-based k-NNR first adopts the MML approach proposed in Ref. [27]. The model first defines the constraints based on triplets, and then formulates the optimization problem as a convex quadratic programming problem. In this algorithm, the number of nearest neighbors Kk is to be determined.

(2) SVR achieves a tradeoff between structural risks and empirical risks by means of the regularization coefficient C and achieves nonlinear mapping by introducing kernel methods. In this paper, different kernels such as the linear kernel, the Gaussian kernel, and the polynomial kernel are compared with each other, and the Gaussian kernel is demonstrated to be better for these regression problems. Thus, the regularization coefficient C and the kernel parameter are to be optimized.

(3) RF is one of the most famous bagging algorithms. It ensembles multiple weak models to reduce the variance of model predictions. Random row sampling and column sampling strategies further improve the generalization ability of the model. In this algorithm, the maximum depth and the number of estimators are optimized by fivefold cross-validation.

(4) XGBoost generates a series of estimators by fitting the residual information with weak models, which are estimated by the second-order Taylor expansion of the loss function. The algorithm has been demonstrated to have a stable and accurate prediction ability in many practical application scenarios. LightGBM [40] is one of the improved branches of XGBoost, which uses the leavewise splitting method and applies the histogram method for preprocessing to accelerate computing. It has been demonstrated that LightGBM can greatly improve computing performance while ensuring the prediction performance. Therefore, we chose the LightGBM as one of the comparison methods. The parameters to be tuned include the maximum number of leaf nodes Nlgbl , learning rate lrlgb, and -norm regularization coefficient .

(5) NNs are effective tools to solve regression problems. We implemented strategies including BN and dropout, which have been demonstrated to be the state of the art in various fields [35]. To be specific, the batch size is chosen to be 30, the proportion of dropout is 0.3, and the number of hidden neurons is chosen by fivefold cross-validation.

(6) Deep factorization machines (DeepFMs) [41] have made great progress in click-through rate (CTR) prediction [42] and stock market prediction [43] tasks. A DeepFM aims to learn both lowand high-order feature interactions by combining the factorization machine (FM) and a DNN. The embedding vector obtained by the FM is used as the initial embedding state of the algorithm. We apply it in regression problems where there is only one feature attribute in each field. A two-layer NN is used, which includes BN layers and dropout layers. In this algorithm, the embedding dimension  and the number of hidden layers in the NN need to be optimized by cross-validation.

(7) DML-based k-NNR algorithms aim to find essential features by using DML algorithms and to find the most similar samples based on these features. Since this paper solves regression problems, it is impossible to construct triplet sets [29–31]. Following the principle that similar inputs lead to similar outputs, we employ the loss function, as shown in Eq. (13):

where  represents the embedding operator obtained by metric learning. On this basis, similar points in the feature space have similar labels; then, the k-NNR algorithm can be used to make a prediction. The parameters in this predictor include the embedding dimension ddmle and the number of nearest neighbors kdml

For the proposed LQEL algorithm, the parameters that need to be determined include the dimension of metric embedding , the number of hidden neurons in the two employed NNs and , and the number of neighbors klqel. In this paper, fivefold cross-validation tests are carried out for each of these parameters to obtain the best selection scheme. The user-specified parameters utilized in our experiments are provided in Table 2.

《Table 2》

Table 2 Hyper-parameters employed in case study.

The results in the table show that the number of nearest neighbors in the LQEL varies with different datasets. First, it depends on the scale of the dataset, which determines the density of the samples in the space. For example, in the critical assessment of protein structure prediction (CASP) dataset, the instances are sufficient for the neighbors to be better referenced for prediction. This implies that a large number of nearest neighbors can effectively improve the prediction ability of the model. However, for the industrial fineness dataset, limited samples are available for modeling. In addition, it is difficult to use the values of the instrumental variables for state representation. For example, the quantity of slag rejection in a vertical roller mill (VRM) is often evaluated by the current of the bucket elevator, but current drift occurs when regular maintenance is carried out (approximately once every 2 days), especially when lubricating oil is added. Therefore, it is necessary to pay more attention to the changes in the current. Under these circumstances, the nearest neighbors in the space may not be as instructive as those of the CASP dataset. Therefore, the model chooses a small number of neighboring samples for prediction. The table also implies that the proposed LQEL model with simple forward NNs can perform well in regression problems. Compared with the forward NN model, there are fewer hidden neurons in the LQEL model (no more than four), and a smaller scale of parameters must be estimated. This reduces the model complexity, thereby improving the model generalization.

《4.3. Performance comparison of different datasets》

4.3. Performance comparison of different datasets

To compare the performance of the proposed method with the abovementioned classical methods, a total of nine regression problems on seven datasets were used. Each experiment was repeated 30 times, and the MSE and mean absolute error (MAE) on the test sets were recorded. Then, statistical analyses were carried out on these indexes to validate the robustness of the algorithm.

Table 3 shows the average indexes of each algorithm on different datasets. The best performance for each line is marked in bold. It can be seen that, for the nine verification tests listed below, the LQEL algorithm proposed in this paper achieves the best performance on most of the datasets. Moreover, the LQEL algorithm achieves a performance comparable to those of the best-performing LightGBM and RF algorithms, and it has clear advantages when compared with other algorithms.

《Table 3 》

Table 3 Performance comparison of different algorithms.

Bold values in each line indicate the best performance among different algorithms.

Moreover, to evaluate the robustness of the algorithm, it is necessary to compare the distribution of the obtained indexes. The MSE and MAE distributions of multiple repeated tests are shown with box plots in Figs. 5 and 6, respectively. The figure implies that the LQEL has the most remarkable stability on most datasets, except for the wine quality, CASP, and fineness datasets. Although the performance fluctuates slightly more than some of the other algorithms, the overall MSE and MAE are significantly lower—that is, the algorithms with more stable performances often sacrifice precision as the cost. In particular, strategies such as dropout, batch learning, and BN are implemented in both the NN and LQEL algorithms, but the latter outperforms the former.

《Fig. 5》

Fig. 5. MSE box plots of algorithms tested on different datasets. MML: MML-based k-NNR; LGB: LightGBM; DML: DML-based k-NNR.

《Fig. 6》

Fig. 6. MAE box plots of algorithms tested on different datasets.

Figs. 7 and 8 show scatter plots of the prediction results for different algorithms on the two industrial datasets, in which the abscissa is the ground truth value and the ordinate is the prediction results. The coefficient of determination (R2 ) is marked on the top left corner, and indicates that the LQEL algorithm shows advantages over the other algorithms on these two soft sensing applications. This can be attributed to two aspects:

《Fig. 7》

Fig. 7. Scatter plots of the prediction results for different algorithms on the fineness dataset.

《Fig. 8》

Fig. 8. Scatter plots of the prediction results for different algorithms on the hydrocracking dataset.

(1) The absolute value of the variables in these industrial datasets cannot well describe the process state. The method proposed in this paper makes corrections to the nearest neighbors according to the change of auxiliary variables, which puts greater emphasis on the differences and thus reduces the risk of the above problem.

(2) This method employs two extremely simple NNs to achieve LQEL. One NN aims to find the coefficients of local quadratic functions, and the other realizes the weight assignment for predictions given by nearest neighbors. Based on these advantages, the generalization ability of the proposed algorithm can be effectively improved.

《5. Conclusions》

5. Conclusions

The paper proposed an LQEL algorithm for regression problems. MML is first improved by optimizing the consistency of the distances between samples in the input and output space. By relaxing the constraints, the modified problem is proved to be a convex optimization problem, while it keeps the same solution as the original problem. Based on this, a locally quadratic embedding model is developed, and different weights are assigned to the prediction results to minimize the expectation of prediction error. In this framework, two extremely simple NNs are implemented to learn the quadratic embedding matrix and the weight assignments of the neighboring predictions. We hope to build a unified end-toend model that prevents the independent two-layer optimization from getting stuck in a local optimal. The proposed LQEL model has the following advantages:

• A global consistency for distances in the input and output space is achieved via improved metric learning.

• The information contained in output labels is better exploited, which leads to a better determination of the neighborhood for a certain instance.

• An LQEL framework was proposed based on the local quadratic embedding hypothesis. Two specially designed networks improve generalization by simplifying the model structure from either a global or a local perspective.

• The experimental results show that the LQEL can achieve a more precise and comparable robust prediction when lightweight NNs are employed.

《Acknowledgments》

Acknowledgments

This work was supported by the National Key Research and Development Program of China (2016YFB0303401), the International (Regional) Cooperation and Exchange Project (61720106008), the National Science Fund for Distinguished Young Scholars (61725301), and the Shanghai AI Lab.

《Compliance with ethics guidelines》

Compliance with ethics guidelines

Yaoyao Bao, Yuanming Zhu, and Feng Qian declare that they have no conflict of interest or financial conflicts to disclose.