Abstract: The weighed total least square (WTLS) estimate is very sensitive to the outliers in the partial EIV model. A new procedure for detecting outliers based on the data-snooping is presented in this paper. Firstly, a two-step iterated method of computing the WTLS estimates for the partial EIV model based on the standard LS theory is proposed. Secondly, the corresponding w-test statistics are constructed to detect outliers while the observations and coefficient matrix are contaminated with outliers, and a specific algorithm for detecting outliers is suggested. When the variance factor is unknown, it may be estimated by the least median squares (LMS) method. At last, the simulated data and real data about two-dimensional affine transformation are analyzed. The numerical results show that the new test procedure is able to judge that the outliers locate in x component, y component or both components in coordinates while the observations and coefficient matrix are contaminated with outliers
Keywords: Partial EIV modelPartial EIV model,Two-step iterated methodTwo-step iterated method,Weighted total least-squaresWeighted total least-squares,Outlier detectionOutlier detection,Data-snoopingData-snooping,Two-dimensional affine transformationTwo-dimensional affine transformation.
Resumo: O estimador dos Mínimos Quadrados Total é muito sensível à presença de outliers no modelo de observações de erro. Neste trabalho apresenta-se um novo modelo para detecção de outliers baseado na técnica data-snooping. Primeiro, é proposto um método iterativo para determinar o estimador dos Mínimos Quadrados Total na teoria dos Mínimos Quadrados. Em seguida, o teste estatatístico w é construído para detectar outliers enquanto as observações e a matriz de coeficientes são contaminadas com a presença de outliers, sendo sugerido um algoritmo específico para detecção de outliers. Quando o fator de variância é desconhecido, ele deve ser estimado pelo método dos Mínimos Quadrados Medianos. Foram analisados dados simulados e reais. Os resultados numéricos mostraram que o método proposto é capaz de identificar se os outliers se encontram nas componentes em x ou em y, enquanto as observações e a matriz de coeficientes são contaminados com outliers.
Keywords: Modelo EIV Parcial, Método Iterativo Two-step, Estimador dos Mínimos quadrados Total, Detecção de Outlier, Data-snooping, Transformação Afim bidimensional.
Article
OUTLIER DETECTION IN PARTIAL ERRORS-IN-VARIABLES MODEL
Detecção de outliers usando um modelo de observações de erro
Received: 18 July 2015
Accepted: 11 June 2016
Gauss-Markov (G-M) model and least-squares (LS) method are widely used in geodetic science. Most of time, the elements of the coefficient matrix may be consisting of the observations possessing the statistical properties in many applications such as the coordinate transformation (Akyilmaz, 2007; Li et al., 2012; Li et al., 2013; Fang, 2014), and the estimates of the unknown parameters derived by the LS method would not be optimal because the statistical properties of the elements in the coefficient matrix are ignored. The errors-in-variables (EIV) model and so called total least-squares (TLS) method named by Gloub et al. (1980) are more rigorous than the LS method. There are many algorithms to compute the TLS estimate (Gloub et al.,1980; Schaffrin, 2006) or weighted TLS (WTLS) estimate (Schaffrin and Wieser, 2008; Shen et al., 2011; Xu et al., 2012; Amiri-Simkooei and Jazaeri, 2012; Mahboub, 2012; Fang, 2013; Jazaeri et al., 2014).
Unfortunately, like the LS estimate, the WTLS estimate is also extremely vulnerable to the outliers in the EIV model. Although many methods for detecting the outliesr in the G-M model are investigated extensively (Baarda, 1968; Pope, 1976; Kok, 1984; Huber 1981; Hekimoglu, 2005; Gui et al. 1999, 2005a, 2005b, 2007, 2011; Guo et al., 2007; Hekimoglu and Erenoglu, 2009; Lehmann, 2013; Hekimoglu et al., 2014), they cannot be directly employed to deal with the outliers in the EIV model. Schaffrin and Uzun (2011) have generalized the mean-shift method to detect a single outlier located either in the observations or in the coefficient matrix in the EIV model. The reliability was also analyzed (Schaffrin and Uzun, 2012). Amiri-Simkooei and Jazaeri (2013) applied the data-snooping procedure to identify the outliers based on the WTLS method formulated with the standard LS theory (Amiri-Simkooei and Jazaeri, 2012). However, the test procedure is required to be implemented more than once while there are some repeated random elements in the different locations of the coefficient matrix like the two-dimensional affine transformation.
The partial EIV model is a generalized EIV model and can avoid considering the correlations between the repeated random elements in the coefficient matrix (Xu et al., 2012). Therefore, it is a more proper model to be used to deal with the case where the coefficient matrix follows a structured characteristic. Unfortunately, the test statistics for detecting the outliers cannot be clearly derived through the existing WTLS method. For this reason, a new two-step iterated approach of computing the WTLS estimates under the framework of LS theory is developed in this paper so that some test statistics of identifying the outliers for the partial EIV model can be constructed.
The remaining of the paper is organized as follows. In Section 2, a two-step iterated method for the partial EIV model taking advantage of LS theory is proposed. In Section 3, the corresponding w-test statistics are constructed to detect the outliers while the observations, coefficient matrix or both are contaminated with the outliers and an algorithm for detecting outliers in the partial EIV model is designed. If the variance factor is not known, we will employ the least median squares (LMS) method to estimate it. In a latter section, a simulated data and a real data about two-dimensional affine transformation are used to verify the validity of the proposed method. In the end, some concluding remarks are presented.
As a matter of fact, not all elements of the coefficient matrix are random and there are some repeated random elements in the different locations of the coefficient matrix such as the coordinate transformation. As a result, their correlations between the repeated random elements must be taken into account. The five rules (Mahboub, 2012) can be used to determine the variance-covariance matrix of the coefficient matrix. However, if the partial EIV model proposed by Xu et al. (2012) is considered, the correlations can be avoided so that the additional burden is reduced. Therefore, the partial EIV model is more superior to be adopted. The function model is shown as following:
(1)Where X= t×1 vector of unknown parameters; L= n×1 vector of observations; In = n×n identity matrix; h= nt×1 vector that is consisting of zero and fixed elements of the coefficient matrix A;B =nt×s known structured matrix; s=the number of different random elements of A = invec(h+Ba);
= s ×1 true values vector of a; e = s ×1 random errors vector of a; Δ= n ×1 vector of random errors of observations; invec is a mathematic function for transforming an nt×1 vector to an n × t matrix;
=Kronecker product operator. The stochastic model is expressed as follows:
(2)Where QL= n×n cofactor matrix of L; Qa= s×s cofactor matrix of a; σ2=unknown variance factor.
A two-step iterated method of computing the WTLS estimate for the partial EIV model is proposed in order to develop an outlier detection method suitable for the partial EIV model. For any given X(0), the model (1) can be transformed as follows:
(3)Furthermore, the model (3) can be rewritten as
(4)where
. If we denote
(5)the estimate of
can be derived by the LS principle (Koch, 1999). As a result, we have
(6)The residual vector of a is
(7)Inserting
into the first equation of the model (1) yields
(8)If the inverse transformation of the mathematic operator vec (invec) is used, we can obtain
(9)Then the model (8) is easily rewritten as follows:
(10)Similarly, based on the LS principle (Koch, 1999), the estimate of X is
(11)and the residual vector of L is
(12)The posterior estimate of the variance factor, which can be obtained from Equation 7 and Equation 12, is
(13)The data-snooping method suggested by Baarda (1968) is employed extensively in geodetic data processing for detecting the outliers (Kok, 1984; Koch, 1999). If the observations or coefficient matrix in the partial EIV model are contaminated with the outliers, the following w-test statistics can be constructed based on Equation 6 or Equation 11 to detect the outliers:
(14)
(15)where
,
,
;
and
are an unit vector with the ith and jth element equal to 1, respectively; N(0,1) represents the standard normal distribution.
In general, when the variance factor is unknown, its posterior estimate
can be adopted (Pope, 1976). Then we have
(16)and
(17)Where τn = τ distribution with n degree of freedom. The computation about τ distribution can be found in Baselga (2007) and Guo and Zhao (2012).
The robust method is an efficient one to estimate the variance factor. By employing the least median squares (LMS) method (Rousseeuw and Leroy, 1987), the variance factor may be estimated by
(18)or
(19)So the test statistics (14) and (15) with (18) and (19) become
(20)and
(21)The superiority of the above two test statistics is that they are very robust to the outliers so that it is more reliable for them to be used for detecting the outliers. It is to be noted here that they do not strictly follow a normal distribution. Therefore, it is very hard to give the exact probability distributions of them. In order to simplify the computation of the threshold value which is used to identify the outliers, the upper percentage point of the standard normal distribution is still used when the principle of identifying the outliers is established.
The implemented procedure for detecting the outliers in the partial EIV model is summarized as follows:
Step1. Give a,L,h,B,QL,Qe and define
.
Step2. Set the initial value
.
Step3. For any k, compute
, ,
Step4. Compute
.
Step5. Compute
and
Step6. If
, the iteration will be stopped, where
is a given value. Otherwise, return to Step 3.
Step7. Compute
,
and
.
Step8. According to the data-snooping procedure, for single outlier, if
and are satisfied simultaneously, one can judge that the outlier locates in the observation equation containing the observation Lj and coefficient matrix element ai . For multiple outliers, if
and
we will deem that the corresponding observation equation containing the observations Lj and coefficient matrix elements ai is contaminated with outlier. But one still can’t confirm that the outliers locate in the observations or coefficient matrix, or both. Here uα is the upper α-percentage point of the standard normal distribution.
Step9. If multiple outliers exist in the observations or coefficient matrix, the above procedure of Step 1 to Step 8 should be repeated until all the w-test statistics are smaller than the threshold value.
The mathematic model for the two-dimensional affine transformation is expressed as follows:
(22)

The data are displayed in Table 1, which is taken from Amiri-Simkooei and Jazaeri (2013). In this example, there are ten points in total. So the partial EIV model is
(23)where
,
,
,
,
,
,
,
,
,
,
,
.
In order to give the reliable evaluations for the proposed outlier detection method, the following five schemes for adding outliers are discussed. The significant level for determining critical value is set as 0.05, which is very frequently used (Gao et al. 1992).
Scheme 1: According to Amiri-Simkooei and Jazaeri (2013), the outlier of magnitude 0.1 m which is 10 times of the priori standard deviation, is added into the xs component of point 4 in the start system.
The residuals of the observations and random vector a and the corresponding w-test statistics are displayed in Table 2. Obviously, the absolute values of residuals of the x components of point 4


in the start system and target system are greater than others. Meanwhile, both
and
surpass the threshold value u0.975 = 1.96. So we deem that there is an outlier in the x component of the start system, target system or both, which is kept the same with the set simulated case. However, we can’t determinate the special position of the outlier.
Scheme 2: The outlier of magnitude 0.1 m is added into both components of point 4 in the start system.
The residuals and w-test statistics are shown in Table 3. As we know, the absolute values of residuals of the x components of point 4 in both coordinate systems are greater than others.
Particularly, both
and
for the
component of point 4 are beyond the threshold value 1.96, and
for the yt component of point 4 in the target system exceeds 1.96 too. Although
for the ys component of point 4 in the start system is smaller than the threshold value 1.96, the absolute values of w-test statistics and their corresponding absolute values of residuals are very tremendous. Thus, both components of point 4 are considered to be contaminated with outliers. Unfortunately, we can’t discriminate the specific positions of these outliers .
Scheme 3: The outlier of magnitude 0.1 m is added into the xs component of point 4 in the start system and the yt component of point 4 in target system.
The residuals and the w-test statistics are obtained, which is displayed in Table 4. The results from Table 4 show that the test statistics satisfy
and
, which shows that the x component of point 4 is possibly contaminated with an outlier. Although the absolute value of residual for the yt component of point 4 in the target system is small,
and the absolute value of residual for the ys component of point 4 in the start system demonstrate that there is an outlier in the y component.




Scheme 4: The outlier of magnitude 0.1m is added into the y component of point 4 in both start system and target system.
The concrete results are presented in Table 5. It is not difficult to know
and
from Table 5, but the absolute values of other w-statistics are smaller than 1.96. It means that only y component of point 4 contains an outlier, which is consistent with the set simulated case. If we will delete point 4 in both coordinate systems, the new results about the residuals and w-test statistics are obtained, which is displayed in Table 6. It is shown that all
and
are smaller than the threshold value 1.96, which demonstrates that the remaining observations are clean without the effects of outliers.
We just discuss the case that the outlier locates in the same point in two different systems for scheme 1 to 4. In fact, there may be multiple outliers in the different points for the two-dimensional coordinate transformation. Hence, the following scheme 5 is used to assess the efficiency of the proposed procedure for detecting multiple outliers in the partial EIV model.
Scheme 5: In this simulation, two outliers of magnitude 0.1 m are added to the xs component of point 2 in the start system and the yt component of point 4 in the target system, respectively.
The detail results about the residuals and w-test statistics are listed in Table 7.
and
indicate that the x component of point 2 contains an outlier. On the other hand, due to
and
, the ycomponent of point 4 is probable to be contaminated with an outlier. Because the outlier may locate in the different locations, we will delete point 2 in both coordinate systems firstly. After that, the new results and w-test statistics are obtained, which can be found in Table 8. Apparently, there is an outlier in y component in the start system or target system or both based on the criterion for identifying outlier. As a result, point 4 in both two coordinate systems should be deleted. After removing the assigned outlying observations, the new results about the residuals and w-test statistics are presented in Table 9, which indicates that there is no outlier in the observations of both coordinate systems.


The example is about the map rectification. The 2D affine transformation is used to rectify the map. The scale of map is 1:500 for Figure 1.

There are ten common points whose theoretical coordinates are previously known, and then we sample their coordinates on the distorted map. The affine transformation is used to rectify the map. The sampled coordinates and theoretical coordinates of common points are treated as the coordinates in the start system and target system, respectively, which is displayed in Table 10.
The transformation parameters can be estimated by using the common points with the 2D affine transformation. By employing the proposed algorithm, the residuals and w-test statistics of the observations and random vector a are derived, which is shown in Table 11. Because the w-test statistics satisfy
and
, the point 7 is suspected as an outlier and should be deleted. Then the new residuals and w-test statistics are obtained, which can be found in Table 12. Due to
for point 9 in the target system, there are no outliers in the observations even if
according to the criterion for identifying the outliers in section 3. Therefore, the only outlier is identified. After that, the transformation parameters are estimated by the WTLS method. The results are presented in Table 13. By checking the reliability of the proposed method, the fifteen non-common points are employed to >to evaluate the performance of the proposed algorithm and RMSE (Root mean square error) is used to judge the influence of outlier for the coordinates. The RMSE for the data-snooping procedure is 0.00892, but is 0.032786 for the WTLS method with outliers. The reason is that the transformation parameters estimated by the WTLS method are disturbed with the outliers.




to evaluate the performance of the proposed algorithm and RMSE (Root mean square error) is used to judge the influence of outlier for the coordinates. The RMSE for the data-snooping procedure is 0.00892, but is 0.032786 for the WTLS method with outliers. The reason is that the transformation parameters estimated by the WTLS method are disturbed with the outliers.
The WTLS estimate of the partial EIV model may strongly be influenced by the outliers. The aim of this paper is to develop an approach to detect the outliers in the partial EIV model. Firstly, we propose a two-step iterated method of computing the WTLS estimates for the partial EIV model based on the standard LS theory. Then the corresponding w-test statistics are constructed to detect the outliers while the observations, coefficient matrix or both are contaminated with the outliers. If the variance factor is unknown, it may be estimated by the LMS method. Making using of the proposed two-step iterated method, the implement algorithm for detecting the outliers in the partial EIV model is proposed. Through the numerical results with the two-dimensional affine transformation, the identification of outliers is implemented only once through the proposed procedure compared with previously approach while single outlier is considered. For multiple outliers, the repeated test with step by step is suggested. However, we still can’t discriminate that the outliers locate in the observation or coefficient matrix or both, which is a very open problem to be discussed in the future
The authors are grateful to the editor and two anonymous reviewers for their constructive comments and suggestions so that the paper is substantial improved. This research has been supported jointly by the National Natural Science Foundation of China (Grant No. 41174005, 414714009) and State Key Laboratory of Geodesy and Earth’s Dynamics (SKLGED 2017-3-2-E)













