fredag 20. juni 2014

Statistical power of temperature trends

This is the fourth blog post in a series of five that analyse trends in the global surface temperatures. The posts put emphasis on the mathematics and the statistics used in the analyses. The posts are numbered 1 to 5. They should be read consecutively.

Post 1    Linear regression analysis
Post 2    Hypothesis testing of temperature trends
Post 3    Confidence intervals around temperature trend lines
Post 4    Statistical power of temperature trends
Post 5    Piecewise linear regression applied to temperature trends

The posts are gathered in this pdf document.

Start of post 4 Statistical power of temperature trends

β (beta) is the probability of not rejecting the null hypothesis H0 when it is false. This is a type II error. Statistical power is the probability of rejecting a false null hypothesis. It is 1 minus β.

We assume that the null hypothesis is false and that the alternative hypothesis H1 is true, i.e. that there is a true long term temperature trend different from zero. But we do not know the true trend, only that it is different from zero. The t-value is the slope of the trend divided with its 1-σ uncertainty. We need the t-value of a trend under the alternative hypothesis in order to calculate β and statistical power. In lack of better information we may assume that the trend calculated based on a set of temperature measurements (later called a dataset) is the true trend, and therefore use the t-value of that trend as the t-value under the alternative hypothesis. With this approach we calculate the post-hoc (retrospective) statistical power. Another approach is to estimate the trend and its noise based on available knowledge independent of a specific dataset being analyzed, and thereafter use this for the trend under the alternative hypothesis. With this approach we calculate the a-priori (prospective) statistical power.

The post-hoc statistical power is strongly influenced by the specific dataset being analyzed. It provides much of the same information as the calculated slope and the calculated p-value do, and it therefore tends to support the conclusions already drawn based on these values. This is not the case for the  a-priori statistical power, which we therefore regard as the most valuable of the two. We will therefore concentrate on the a-priori statistical power in the rest of this chapter.

The statistical power is a function of:
  • The slope of the trend used under the alternative hypothesis.
    This is the effect size. A big effect size increases the statistical power. That's because it's easier to distinguish a big slope from zero than to do the same with a small slope.
  • The noise and the autocorrelation in the temperature measurements.
    Noise and autocorrelation reduce the statistical power. That's because they may hide the true slope.
  • The length of the time interval which the trend is calculated over.
    Long intervals increase the statistical power. That's because long intervals contain more information than short intervals, and because the noise tends to level out in long intervals. 

The red curve in Figure 4.1 is the pdf (probability density function) for the t-value of a trend used under the alternative hypothesis. The calculated t-value is is shown with the dotted red line. It is greater than the critical t-value of the null hypothesis. The null hypothesis is therefore rejected.

Figure 4.1: β and statistical power shown as areas under the probability density function for the t-value of a trend used under the alternative hypothesis. The plot assumes that the null hypothesis is false.

β is the red area in Figure 4.1. The p-value is 0.027 (not shown in the figure). Despite this, the statistical power is only 0.63. The probability to fail to reject the null hypothesis is substantial, even if the true trend is equal to the calculated trend and the latter is statistically significant with a good margin.

The best way to increase the statistical power is to increase the time interval which the trend is calculated over. Experience shows that the interval should be at least 20 years to obtain a statistical power higher than 80%, which is the recommended lower threshold. When the statistical power is 80%, the probability to fail to reject a wrong null hypothesis is 20 %. The significance level α equal to 0.05 is a more rigorous threshold. We are less willing to reject a true null hypothesis than than we are to fail to reject a false one.

We know neither the true temperature trend nor the noise and autocorrelation in the monthly temperatures. We will now set these values equal to what we regard as the most probable ones based on available knowledge. This a-priori knowledge will be used when calculating the statistical power for different lengths of the interval which a trend is calculated over.

The a-priori knowledge may be based on the temperatures in a reference period. Satellites have improved the accuracy of the global surface temperature measurements. A good alternative is therefore to use the temperatures since the start of the satellite era in 1979 as the a-priori knowledge. We do so in the calculations behind Figure 4.2.

The slope of the trend from January 1979 to December 2013, i.e. in the satellite era, is 0.154°C/decade.  Figure 4.2 shows the a-priori statistical power for effect sizes less than, close to and greater than that value.

Figure 4.2: The probability for a calculated trend to be statistically significant. It is drawn as a function of the length of the interval which the trend is calculated over, and it is drawn for three different effect sizes. The noise and the autocorrelation is fixed, i.e. independent of effect size and interval length.

The standard error of the regression is 0.127°C and the autocorrelation compensation factor is 12.0  in the HadCRUT4 temperatures between January 1979 and December 2013. These two values are  a-priori knowledge used in the calculations behind  Figure 4.2. With these two values fixed, and with the effect size (slope) as a parameter, statistical power is a function of the length of the interval which the trend is calculated over.

We often require that the statistical power shall be at least 0.8, i.e. above the horizontal violet line in Figure 4.2. When it is less than 0.8, we should not put much emphasis on the lack of statistical significance. The line for the effect size 0.15°C/decade crosses the violet line at 19.5 years. The lines for the effect sizes 0.1 and 0.2°C/decade cross the violet line at 25.5 and 16.5 years. This tells us that we should use intervals longer than 25 years to detect reliable long term trends. WMO defines climate as average weather during 30 years. Then the statistical power is almost 100% for the effect sizes 0.20 and 0.15°C/decade, and it is approximately 95% for the effect size 0.10 °C/decade. When the trend is calculated over shorter time intervals, the true trend may be hidden behind noise and natural cycles.

Mathematics

Figure 4.1 shows the statistical power as the green areas underneath the pdf of the t-value of the trend used under the alternative hypothesis.  The statistical power is calculated as shown below:


The F() function in (4.1) is the cumulative distribution function of the t-value. It tells the probability that a calculated t is less than the input parameter to F(), given that the true trend is equal to the trend used under the alternative hypothesis. See more details in Figure 2.5 and in (2.8) and (2.9) in post 2.

The |t| in (4.1) is the absolute value of the t calculated for a trend under the alternative hypothesis. If the trend calculation is based on a specific dataset, (4.1) and (4.2) calculate the post-hoc statistical power for that dataset. If the trend calculation is based on a-priori knowledge together with a specific length of the time interval which the trend is calculated over, (4.1) and (4.2) calculate the a-priori statistical power for that length of the time interval.

(2.2) to (2.5) in post 2 show that the estimated 1-sigma uncertainty of the slope is a function of the length of the interval which the trend is calculated over. This is shown with the dotted line in Figure 4.3. (3.1) in post 3 shows that this is also the case for the ± limits of the slope confidence interval, as shown with the solid line in Figure 4.3. Both the standard error of the regression and the autocorrelation compensation factor are kept at constant values in the calculations behind these lines.

Figure 4.3: The uncertainty of the slope as a function of the length of the  interval which the trend is calculated over

The decline in the uncertainties in Figure 4.3 is the reason for the increase in the statistical power in Figure 4.2 as the length of the interval which the trend is calculated over increases.

A calculated trend is statistically significant when the 95% confidence interval of its slope does not cover zero. The horizontal lines in Figure 4.3 are the effect sizes for which the statistical power was calculated in  Figure 4.2. The green line for the effect size 0.15°C/decade crosses the solid line of the ± limit of the confidence interval when the interval length is 16 years. Then the likelihood of achieving statistical significance is just as great as not doing so, i.e. the statistical power is 50%. Figure 4.2 confirms this.

References for the mathematics

Hans von Storch, Francis W. Zwiers. 2001. Statistical Analysis in Climate Research is our main reference. Chapter 6 'The statistical test of a Hypothesis' explains hypothesis testing in general.

Park, Hun Myoung. 2008. Hypothesis Testing and Statistical Power of a Test explains statistical power of a test. See specially chapter 2 'Type II Error and Statistical Power of a Test'.

Thom Baguley. 2004. Understanding statistical power in the context of applied research discusses both post-hoc (retrospective) and a-priori (prospective) statistical power. The paper strongly warns against misuse of post-hoc statistical power.

Previous post and Next post in the series

1 kommentar:

  1. Hi

    Thanks for the nice blog :)

    Correct we should focus on the priori power, not the post-hoc power.
    The post-hoc power is directly correlated with the p-value.
    (when calculating base on the observed effect)

    The null assumption distribution should be non-central t, which is similar to a "shifted t", but not symmetric

    You can see an accurate dynamic chart with h1 uses "non-central t" distribution in the following calculator:
    http://www.statskingdom.com/30_test_power.html
    just use "t" distribution

    SvarSlett