fredag 17. februar 2017

Introduction to Statistical analysis of data with outliers

This is the first blog post in a series of six that deals with mathematics for calculation of correlation and trend in data with outliers. The posts are numbered 1 to 6. They should be read consecutively. This first post is just an introduction.

Post 1  Introduction to Statistical analysis of data with outliers
Post 2  Calculate correlation when outliers in the data.
Post 3  Calculate trend when outliers in the data.
Post 4  Correlation and trend when an outlier is added.   Example.
Post 5  Compare Kendall-Theil and OLS trends.             Simulations.
Post 6  Detect serial correlation when outliers.                Simulations.

The posts are gathered in this pdf document.

Start of post 1 Introduction to Statistical analysis of data with outliers


Five blog posts in June 2014 deal with the mathematics that is most commonly used when analysing global temperature series. That mathematics is not well suited when there are large outliers in the data. The first blog post in that series gives an overview of those five posts.

Ordinary least square (OLS) error mathematics is the most commonly used method to calculate trends. It is based on data values, and it therefore performs poorly when there are large outliers in the data. Global temperatures do not have large outliers due to both the inertia in the global climate system and due to the thorough processing before the temperature data is released. Other climate data, such as precipitation, snow depth and skiing conditions at specific locations, have large outliers, and the OLS mathematics is not suitable for those data.

The calculation of the Pearson correlation coefficient is also based on data values. This is the most commonly used method to calculate correlation between variables. It too performs poorly when there are large outliers in the data.

Mathematics based on data ranks performs better than mathematics based on data values when analysing data with large outliers. In this series of blog posts I will describe the rank mathematics which I use to calculate the Kendall tau-b correlation coefficient and the Kendall-Theil robust trend line. For comparison I also shortly describe the Pearson and the OLS mathematics.

As will be seen, the mathematics that is used to calculate the Kendall tau-b correlation coefficient and the Kendall-Theil robust trend line is rather simple and easy to explain. But the mathematics that is used to quantify their uncertainties, which are p-values and confidence intervals, is more complicated.

Next post in the series

Ingen kommentarer:

Legg inn en kommentar