tirsdag 28. februar 2017

Correlation and trend when an outlier is added

This is the fourth blog post in a series of six that deals with mathematics for calculation of correlation and trend in data with outliers. The posts are numbered 1 to 6. They should be read consecutively.

Post 1  Introduction to Statistical analysis of data with outliers
Post 2  Correlation when outliers in the data.
Post 3  Trend when outliers in the data.
Post 4  Correlation and trend when an outlier is added. Example
Post 5  Compare Kendall-Theil and OLS trends.                  Simulations.
Post 6  Detect serial correlation when outliers.                     Simulations.

The posts are gathered in this pdf document.

Start of post 4:
Correlation and trend when an outlier is added.


This blog post contains an example that demonstrates the shortcomings of the mostly used methods to calculate trends and correlations when an outlier is added to the data. It demonstrates that alternative methods based on medians and ranks are more robust against outliers.

This blog post is based on an example in a course in statistics at Penn State. I chose this example for two reasons. Firstly the course uses other equations for the Kendall tau-b correlation coefficient than mine equations (2.7), (2.8) and (2.9) in the second blog post in this series, and secondly it contains tied values. The course provides the answer with four decimal digits, which allows me to test that my equations are equivalent with the equations in the course and that my implementation is correct.

The x and y variables in the example are:
x: 23   23  27   27   39   41   45   49   50   53   53   54   56   57   58   58   60   61
y:  9.5 27.9 7.8 17.8 31.4 25.9 27.4 25.2 31.1 34.7 42.0 29.1 32.5 30.3 33.0 33.8 41.1 34.5
They are shown graphically in Figure 4.1.

4.1   Correlation

The correlation coefficients calculated with mine equations are shown in the legend in Figue 4.1. They are exactly the same as calculated by Penn State.

Figure 4.1: Correlation between the Penn State x and y values calculated with Pearson (r), Spearman (rho) and Kendall (tau-b). The coefficients are exactly the same as calculated by Penn State.

The Kendall tau-b correlation coefficient is smaller in absolute value than the Pearson and the Spearman correlation coefficients. According to Penn State this is usual. I have experienced the same with other data sets too. The calculated p-value is 0.0001 with Pearson, 0.0003 with Spearman and 0.0011 with Kendall.

4.2   Trend

In climate analysis, x is normally a monotonic increasing time and consequently it does not contain tied values. The equations in the third blog post in this series therefore assume no tied x values when calculating trends. The Penn State example has four sets with tied x values. I therefore adjusted the second of the x values in each set by adding 0.1 to it before calculating the trends with both OLS and Kendall-Theil. The trends are shown in Figure 4.2.

Figure 4.2: The OLS and the Kendall-Theil trends in the Penn State example. The example does not contain any outliers, and the two trends are therefore close to each other. The legend text shows the p-values and the 95% Confidence Intervals calculated with each method.

The OLS and the Kendall-Theil trends are close to each other. Both are statistically significant, and both are well within each other's 95% confidence intervals.

The Pearson, Spearman and Kendall tau-b correlation coefficients are almost not influenced by this small adjustment of the x values. The p-values of the Spearmann and the Kendall tau-b correlation coefficients are improved (not shown) because the adjustment increases the number of xy pairs that contributes in the calculations.

4.3   Add an outlier

I changed the fifth last y value from 30.3 to -17.8, i.e. I changed it to be an outlier. Thereafter I calculated the trends anew, as shown in Figure 4.3. The OLS trend is largely influenced by the outlier. It is no longer statistically significant, and its 95% confidence interval is much wider than it was before the outlier was added. The Kendall-Theil trend is almost not influenced by the outlier, neither the slope nor its uncertainty. This demonstrates that the Kendall-Theil trend line is much more robust against outliers than the OLS trend line is.

Figure 4.3: The OLS and the Kendall-Theil trends when the Penn State example is modified to contain an outlier. The OLS trend is largely influenced by the outlier, and it is no longer statistically significant. The Kendall-Theil trend is only slightly influenced by the outlier, and it is still statistically significant with good margins. Both trends are within each others 95% confidence intervals.

The outlier reduces the correlation coefficient between x and y. With Pearson r the reduction is large, from 0.79 to 0.35. The reduction is much smaller with the two rank correlation coefficients; Spearman rho is reduced from 0.76 to 0.62 and Kendall tau-b from 0.59 to 0.49.

Without the outlier the correlation coefficients are statistically significant with all the three methods. With the outlier added, the Pearson correlation coefficient is far from being so (p-vale 0.16), while the rank correlation coefficients are statistically significant with good margins (p-value 0.01 for both Spearman and Kendall).

This demonstrates that the rank correlation coefficients are more robust against outliers than the Pearson coefficient.

Often we don't know if an outlier is an error measurement or if it is real. But anyway it is wrong to let a single or a few outliers have much more influence on the result than the majority of the measurements. Therefore, when the data contains outliers, it may be best to apply trend calculations based on medians, as the Kendall-Theil calculation is, and correlation calculations based on ranks, as the Kendall tau-b calculation is.

Previous and Next post in the series


Ingen kommentarer:

Legg inn en kommentar