lørdag 25. februar 2017

Trend when outliers in the data

This is the third post in a series of six that deals with mathematics for calculation of correlation and trend in data with outliers. The posts are numbered 1 to 6. They should be read consecutively.

Post 1  Introduction to Statistical analysis of data with outliers
Post 2  Correlation when outliers in the data.
Post 3  Trend when outliers in the data.
Post 4  Correlation and trend when an outlier is added.    Example.
Post 5  Compare Kendall-Theil and OLS trends.              Simulations.
Post 6  Detect serial correlation when outliers.                 Simulations.

The posts are gathered in this pdf document.

Start of post 3: Trend when outliers in the data

The method most commonly used in linear regression analysis is to calculate the trend based on the data values. But it is more robust against outliers to calculate it based on the ranks of the data. This blog post discusses the mathematics behind both methods.

A trend line is a simple model of the connection between an independent variable X and a dependent variable Y. In our case, X is usually the time and Y is a climate value (snow depth, precipitation, days with skiing conditions, etc.).

a is the intercept between the trend line and the Y axis, and b is the slope of the trend line.

Linear regression analysis uses a set of xi yi measurements to estimate a and b so that the resulting trend line is a “best fit” to the measurements.

Each yi measurement deviates more or less from the trend line. The vertical distance ei between the yi measurement and the trend line is treated as the error (the residual) of that measurement.

The hat symbol above a and b means the estimate of those values. From now on I will omit the hat symbol.

3.1 OLS - Ordinary Least Square

Ordinary least squares (OLS) is the most commonly used method to calculate trends. It calculates the slope as shown in (3.3). The mean values of x and y are used to calculate the intercept, as shown in (3.4).

The OLS equations show that the calculation of the slope and the intercept is very sensitive to outliers in the data, and it is therefore not robust. OLS is included in this blog post mainly to show that.

I derived and explained the formulas for OLS in three blog posts in June 2014. The posts contain references. The posts are Linear regression analysis, Hypothesis testing of temperature trends and Confidence intervals around temperature trend lines. The latter two posts concentrate on how to quantify the uncertainties of the slope and the intercept.

3.2 Kendall-Theil robust line

The Kendall-Theil method to calculate the slope and the intercept of the trend line is based on medians, and it is therefore robust against outliers. The Kendall-Theil robust line is called by many names, as explained on this Wikipedia site.

We start with an x and a y vector, each with n values. X is the independent variable, often the time and therefore monotonically increasing. We assume that there are no ties among the X values. There are N possible combinations of the xy pairs. (3.5) shows how N is calculated.

We first calculate the slopes between all the N xy pairs. Then we select the median of these slopes, as shown in (3.6). This median is the slope of the Kendall-Theil line. The selection is usually done by first sorting the slopes in ascending order, and then picking the slope in the middle. The vector with the slopes sorted in ascending order will later be used to calculate the 95% confidence interval of the slope.

The intercept is calculated using the slope from (3.6) and the median of x and y, as shown in (3.7). The Kendall-Theil trend line passes through the point with the coordinates (median x, median y), just as the OLS trend line passes through the point with the coordinates (mean x, mean y).

An alternative method to calculate the intercept is to calculate the intercepts of all the N slopes between the xy pairs, and then select the median of these intercepts. But (3.7) is evaluated to be more robust, and it is therefore the preferred equation.

Mann-Kendall trend test

The calculation of the uncertainty of the Kendall tau-b correlation coefficient is described in section 2.3 in the previos blog post. That calculation and the calculation of the uncertainty of the slope of the Kendall-Theil robust line are almost identical. Both use the S statistics, and both assume that the null hypothesis is true. The formulas to calculate the uncertainty of the Kendall tau-b are, however, more complicated because there may be ties in both x and y. When the slope of the Kendall-Theil robust line is calculated, there can be no ties among the x values. Otherwise there would be one or more zeros in the denominator in (3.6).

The equations below use S as calculated in (2.4), m0 in (2.10) and my in (2.12) in the previos blog post.

(3.8) calculates the standard deviation of S.

ZS is the standardized form of S, and it is calculated as shown in (3.9). It approximately follows a standard normal distribution N(0,1) when the null hypothesis is true and n is greater than 10.

The p-value of the trend is the probability to get such a large slope, positive or negative, if the null hypothesis were true. It is calculated as shown in (3.10). F() is the cumulative standard normal distribution. It is multiplied with 2 because the hypothesis test is two-tailed.

When the p-value is less than 0.05 the trend is statistically significant at the 0.05 level.

Serial correlation

The residual vector e is defined in (3.11).

Positive serial correlation occurs when we have measured (sampled) the y values so often that the noise in a measurement is not independent of the noise in the previous measurement. Then consecutive residuals in the e vector tend to be of the same sign and size. Then the effective (independent) number of measurements is less than the actual number of measurements, and the uncertainty of the trend is therefore larger than calculated with the equations (3.8) to (3.10). This may be compensated for.

(3.8) calculates σS , which is the one sigma standard deviation of the S statistics. When there is positive serial correlation in the residuals, (3.8) it too optimistic; i.e. σS should have been larger. We therefore modify σS as shown in (3.12) to (3.14).

rRk in (3.13) is the serial correlation coefficient with lag k. The equation tells that we can use many lags (nsl is an abbreviation for number of serial correlation lags). In practice I will only use the lag 1 and 2 coefficients.

The R superscript tells that the serial correlation coefficient is calculated based on the ranks of the residuals. I use the notation er instead of e when the residual value is replaced with its rank. (3.14) shows how rRk is calculated.

The modified σS is used in (3.9), and thereafter the p-value is calculated as shown in (3.10).

The modification factor n divided with nindependent is often denoted as v. It will vary when it is calculated with different data sets, even when the data sets are of the same type, eg. snow depth. I will therefore calculate it for many data sets of the same type, and thereafter evaluate the different factors and then decide on a factor to be used for that type of data. I never use a factor less than 1.

(The OLS mathematics compensates for serial correlation in a similar way, but the equations are not identical. Equation 2.4 in the blog post Hypothesis testing of temperature trends calculates a v factor which is used in the same way as the ratio between the real and the independent number of measurements in (3.12). The OLS mathematics calculates the serial correlation coefficients in the same way as in (3.14), except that it uses the residual values instead of their ranks.

Confidence Interval

The 95% confidence interval of the Kendall-Theil slope is calculated based on the S statistics and on the vector containing the sorted slopes from which the median was selected in (3.6). The vector with the sorted slopes contains N slopes, including the slopes that are zero due to tied y values. The standard deviation of S, σS, is therefore not reduced due to tied y values (3.15). This σS is used to calculate the indexes Ru and Rl in the vector with the sorted slopes. The slope with index Ru is the upper limit of the 95 % confidence interval, and the slope with index Rl is the lower limit.

The confidence limit α is 0.05 when the 95% confidence interval is calculated. The notation with α is used in (3.16) and (3.17) to make the equations more general. zα/2 is the z value for which the cumulative distribution function of N(0,1) is equal to (1-α/2).

Reference for the mathematics

The article Statistical Methods in Water Resources, written by D.R. Helsel and R.M. Hirsch from the U.S. Geological Survey, shows in chapter 10.1.1 how to calculate the Kendall-Theil robust trend line. Chapter 10.1.4 tells how to calculate the 95% confidence interval of the slope.

Chapter 17.3.2 in the EPA document Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities Unified Guidance tells how to calculate the z-score of the S statistics and how to use it to calculate the p-value of the slope. Chapter 17.3.3 in the same document shows how to calculate the slope and the intercept.

The article The influence of autocorrelation on the ability to detect trend in hydrological series by Sheng Yue et al explains how to compensate for serial correlation when calculating the uncertainty of the Kendall-Theil slope. (Autocorrelation is another word for serial correlation).

Previous and Next post in the series

Ingen kommentarer:

Legg inn en kommentar