San José State University Department of Economics |
---|
applet-magic.com Thayer Watkins Silicon Valley & Tornado Alley USA |
---|
of Temperature and Other Variables Which are the Cumulative Sum of Random Disturbances |
This material is concerned with the objective determination of trends for data which is generated by a process given by
where U[t] is a random variable. Such data will appear to have trends even if the expected value of U[t] is zero for all t. This is illustrated below: (Click on REFRESH to get a new sample and new time series.)
Temperature statistics have such a stochastic structure because the rate of change of temperature T for some region is given by
where T and t are temperature and time, respectively, v(t) is the net energy inflow and C is the heat capacity coefficient of the region. Thus
This stochastic structure also applies to humidity and soil moisture. It also might apply to wind as a vector quantity, but that is a separate topic.
When there is no variation in the parameters of the probability distribution of Uh(t) the temperature reaches a level so that the average value of the temperature changes is zero.
For comparison with the trends shown above for a variable which is the cumulative sums of random disturbances the average global temperatures for the period 1855 to 2003 are shown below.
The changes in global average temperature from one year to another for 1856 to 2003 are shown below:
In general, when confronted with any time series data and asked about possible trends a statistical analyst would regress the data on time and ascertain whether the regression slope coefficient is significantly different from zero statistically. This is inappropriate statistically because if there was only one usually large random increment part way through the data interval it would make the data appear to have a permanent shift when all that was involved was one atypical value. Instead the proper procedure statistically would be to compute the first differences for the data and carryout the statistical analysis on those first differences. This analysis could include the regression on time and the testing for the statical significance of the regression intercept and slope coefficients but would also include simpler analyses.
The raw data consists of n observations on a variable Y. The values of Y are labeled from 0 to (n-1). The number of random increments is (n-1) and their values are labeled from 1 to (n-1). The variable Y can be considered the cumulative sum of the random changes; i.e.,
For simplicity let y(t) and u(t) denote the deviations from the interval means.
The time variable t can also be expressed as deviation from its interval average, Δt, which is a function of time. If there are n sequential observation labeled 0 to n-1, then the average value is (n-1)/2 and Δt = t−(n-1)/2. Thus these Δt values range from −(n-1)/2 to +(n-1)/2. Note that the sum of there values, Σ0n-1Δt, is necessarily zero.
In the above notation the regression slope coefficient b is
For convenience let Σ0n-1(Δt)² be denoted as V.
Let the interval average value of Y be denoted as Y. Then
Note that
This means that
Since Σ0n-1Δt=0
This latter expression evaluates to
Thus the weight of U(s) in the computation of the regression trend is
This is a parabolic function with a maximum around (n-1)/2.
The sum of the squared deviations for the time variable, V, evaluates to n(n²-1)/12.
The regression slope can be expressed as
The profile of these weights is a parabola starting at 0 at s=0 and rising to a maximum at t=n/2 and falling to 0 at s=n. This means that the temperature changes in the middle of the interval have an undue excessive influence on the regression coefficient for the trend in the temperature.
It is worthwhile to verify the above algebraic relationships for a couple of cases. Let n=5. Then the data are:
Time t | Δt | (Δt)² | Y | ΣΔt | κ(t) |
0 | −2 | 4 | Y(0) | ||
1 | −1 | 1 | Y(0)+U(1) | 2 | 0.2 |
2 | 0 | 0 | Y(0)+U(1)+U(2) | 3 | 0.3 |
3 | 1 | 1 | Y(0)+U(1)+U(2)+Y(3) | 3 | 0.3 |
4 | 2 | 4 | Y(0)+U(1)+U(2)+Y(3)+Y(4) | 2 | 0.2 |
Sums | 0 | 10 | 10 | 1.0 |
For n=7 the data are:
Time t | Δt | (Δt)² | Y | ΣΔt | κ(t) |
0 | −3 | 9 | Y(0) | ||
1 | −2 | 4 | Y(0)+U(1) | 3 | 0.1071 |
2 | −1 | 1 | Y(0)+U(1)+U(2) | 5 | 0.1786 |
3 | 0 | 0 | Y(0)+U(1)+U(2)+U(3) | 6 | 0.2143 |
4 | 1 | 1 | Y(0)+U(1)+U(2)+U(3) +U(4) | 6 | 0.2143 |
5 | 2 | 4 | Y(0)+U(1)+U(2)+U(3) +U(4)+U(5) | 5 | 0.178 |
6 | 3 | 9 | Y(0)+U(1)+U(2)+U(3) +U(4)+U(5)+U(6) | 3 | 0.1071 |
Sums | 0 | 28 | 28 | 1.0 |
It is statistically inappropriate and inefficient to give a much higher weight to the random terms in the middle of the interval compared to those at the ends of the interval. For n=5 the weight for the middle terms is 50 percent higher than the weight for the ends. For n=7 the middle terms have weights which are 100 percent larger than the weights for the end terms. For n=21 the weights for the middle terms are 450 percent larger than the weights for the end terms.
Nevertheless the fact that the weights κ(t) sum to unity indicates that the regression coefficient is an unbiased estimate of the trend in Y(t). Thus the problem is one of statistical efficiency rather than unbiasedness.
The graphs below illustrate the effect by showing three cases in which there is only one nonzero disturbance over the data interval but differing in when during the interval the disturbance occurs. The regression lines are shown in blue.
The graphs below illustrate the effect by showing three cases in which there is only one nonzero disturbance over the data interval but differing in when during the interval the disturbance occurs. The regression lines are shown in blue.
As seen in the graphs the regression estimate of the trend is much higher for the disturbance in the middle of the interval compared to disturbance at either end of the interval.
However there is another unbiased estimate of the trend which gives equal weight to all of the random changes. That is
This estimate of trend is illustrated for the three cases previously considered.
As seen above the trend is the same for all three cases in contrast to the trend estimated using regression.
This estimate of the trend is unbiased only if the interval of analysis is not selected with reference to the trend. There are some infamous cases of some climatologist selecting an interval which had an extreme trend and trying to present that trend estimate as the long term trend in the variable. Such selection of the interval of analysis produces a figure that is statistically completely unrelated to the parameters of the distribution for the variable. Such estimates are completely meaningless.
In general the statistical analysis of time series of variables which are the cumulative sum of the random disturbance should be carried out on the values of U(t), the first differences Y(t)-Y(t-1). Whether there is a trend in the Y variable depends upon the expected value of the random variables, U(t); i.e.,
Take the general case in which the trend estimate b# is a weighted average of the disturbances; i.e.,
Assume that the expected values of the random disturbances and their variance and covariances are constant over time and given by
Then
Therefore the expected value and variance of b# are given by
Now the question can be asked as to which weights give the smallest variance. Without a constraint on the sum of the weights the answer would be zero weights. Assume that the weights must sum to unity. This requires that b# is an unbiased estimate of μ
The question is now, what values of the weights, subject to the condition that their sum is unity, will minimize Var(b#). The Lagrangian multiplier method can be used to answer this question. The first order condition for a constrained minimum is
where λ is the Lagrangian multiplier. This condition reduces to
The second order conditions for a minimum are satisfied as well.
Since the weights sum to unity this means that the weights must all be equal to 1/(n-1). (Note that n is the number of observations, n-1 is the number of random disturbances.)
Thus the most efficient estimate of the value of E{U(t)}, the trend rate, is the mean value of the random variable given previously as
This estimate can be compared with the standard deviation of U computed from the values for the interval to establish whether or not the estimated trend is significantly different from zero.
If the U(t)'s are not correlated with each other then the variance of b* is equal to σ²/(n-1), where σ² is the variance of the U(t)'s. Presuming no correlation of the U(t)'s the variance of the regression estimate of the trend is 1.04 times higher than the variance of b* for n=5 and 1.071 times higher for n=7. The reciprocal of this ratio of variances may be called the statistical efficiency of the linear regression estimate. Thus the efficiency of the regression estimate is 96 percent for n=5 and 93 percent for n=7. The linear regression estimate of the trend becomes progressively less efficient as the sample size increases but asymptotically approaches a limiting value of about 83 percent. Here is the case for n=20.
(To be continued.)
HOME PAGE OF Thayer Watkins |