Background | Descriptive Statistics | Inferential Statistics | Power and Sample Size | Exercises
This chapter considers the comparison of a continuous outcomes from two independent groups.
Illustrative data WCGS.ZIP (Selvin, 1991, p. 41). To illustrate techniques, we consider cholesterol levels (mg/dl) in Type A and Type B men. Data are:
Type A: 233, 291, 312, 250, 246, 197, 268, 224, 239, 239, 254, 276, 234, 181, 248, 252, 202, 218, 212, 325
Type B: 344, 185, 263, 246, 224, 212, 188, 250, 148, 169, 226, 175, 242, 252, 153, 183, 137, 202, 194, 213 .
Data are structured as a numeric dependent (outcome) variable and a dichotomous independent (group) variable as CHOL and BEHAVIOR, respectively. The first three records and last record of this data set look like this:
REC CHOL BEHAVIOR
--- ---- --------
1 233 A
2 291 A
3 312 A
etc.
40 213 B
Descriptive statistics for the two groups are computed with a two variable MEANS command applied as follows:
EPI6> READ <DATASET>
EPI6> MEANS <DV> <IV>
where <DV> represents the name of the dependent variable and <IV> represents the name of the independent variable.
For the illustrative data set, the following commands are issued:
EPI6> READ WCGS
EPI6> MEANS CHOL GROUP
Five sections of output are produced (a frequency table, summary statistics, ANOVA table, Bartlett's test, Kruskal-Wallis test). Summary statistics are printed below the frequency table. For the illustrative data, the summary statistics are:
MEANS of CHOL for each category of BEHAVIOR
BEHAVIOR Obs Total Mean Variance Std Dev
A 20 4901 245.050 1342.366 36.638
B 20 4206 210.300 2336.747 48.340
Difference 34.750
BEHAVIOR Minimum 25%ile Median 75%ile Maximum Mode
A 181.000 221.000 242.500 261.000 325.000 239.000
B 137.000 179.000 207.000 244.000 344.000 137.000
Thus, n1 = 20 n2 = 20 (listed under Obs.) and the type A men in the sample have higher mean scores than type B men (245.1 vs. 210.3). In addition, the type A group had less variability than the type B men (standard deviations: 36.6 vs. 48.3).
The observed mean difference (34.750 in this instance) is the point estimate of expected mean difference �1-�2. To calculate a 95% confidence interval for �1-�2, first calculate (by hand) the standard error of the mean difference as follows:
se = SQRT[(MSW)(1/n1 + 1/n2)]
where MSW is the Mean Square Within as reported in Epi Info's ANOVA table:
ANOVA
Variation SS df MS F statistic p-value t-value
Between 12075.625 1 12075.625 6.564 0.013853 2.562113
Within 69903.150 38 1839.557
Total 81978.775 39
For the illustrative data, se = SQRT[(1839.557)(1/20 + 1/20)] = 13.56 mg/dl.
A 95% confidence interval for �1 - �2 is given by:
(mean difference) � (tn1+n2-2,.975)(se)
where (mean difference) = mean1 - mean2, tn1+n2-2,.975 represents the 97.5th percentile of a t distribution with n1 + n2 - 2 degrees of freedom (click here for a t table), and se represents the standard error of the mean difference (described above). Thus, the 95% confidence interval for �1 - �2 for the illustrative data = (245.05 - 210.30) � (t38,.975)(13.56) = 34.75 � (2.02)(13.56) = (7.4, 62.1) mg/dl. This interval places the population mean difference between 7.4 and 62.1 with 95% confidence.
Epi Info calculates the equal variance independent t test for H0: �1 = �2 in its ANOVA table:
Variation SS df MS F statistic p-value t-value
Between 12075.625 1 12075.625 6.564 0.013853 2.562113
Within 69903.150 38 1839.557
Total 81978.775 39
Thus, data demonstrate tstat = 2.56 with 38 degrees of freedom (p = .014). Most investigators would consider this "significant" evidence against H0.
The above confidence interval and test statistics assume data are (1) free of bias (information bias, selection bias, and confounding), (2) groups and individuals within groups are independent, (3) the sampling distribution of the mean difference is normal, and (4) variances in the two populations are equal (homoscedasticity). Although violation of assumptions (3) and (4) may results, numerous studies have shown that these methods allow for considerable departures from normality and equal variance while still providing stable results. The robustness of these last two assumptions is good when samples sizes are equal (n1 = n2), samples are large (n > 30), and a two-sided test is used (Zar, 1996, p. 128). Furthermore, statistical tests need not be realistic in order to be useful.
Statistical models are sometimes misunderstood in epidemiology. Statistical models are never true. The question of whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model. (Zeger 1991).
The Mann-Whitney / Kruskal-Wallis test (for two sample) are non-parametric analogues of the independent t test. Epi Info computes the Kruskal-Wallis test as part of its MEANS command. Here are the results for the illustrative data:
Mann-Whitney or Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups)
Kruskal-Wallis H (equivalent to Chi square) = 6.333
Degrees of freedom = 1
p value = 0.011853
Thus, c2stat = 6.33 with 1 degree of freedom (p = 0.012).
Comment: The Kruskal-Wallis procedure is slightly less powerful than the independent t test when data come from normally distributed populations. The loss of efficiency is surprising small when the test is used in non-normal populations.
When addressing two samples, Bartlett's test addresses H0: s�1 = s�2, where s�i represents the variance in population i. Epi Info performs this test whenever the MEANS command is used. Here are the results for the illustrative data:
Bartlett's test for homogeneity of variance
Bartlett's chi square = 1.404 deg freedom = 1 p-value = 0.236005
Thus, c2 = 1.40, df = 2, p = .24. This provides little or no support for rejecting H0.
Comment: Bartlett's test is reliable only when used in normal populations. When the population distribution is platykurtic, the true p value is less than the calculated p value, i.e., the test is conservative (Maurais & Ouimet, 1986). When population distribution is leptokurtic, the true p value is greater than calculated p value, i.e., the test is liberal . Because t tests are relatively reliable in the face of unequal variance, many statisticians question the use of Bartlett's test as a prequel test to the independent t test. Consider:
It has been shown that in the commonly occurring case in which group sizes are equal, or not very different, the [independent t test] is affected surprisingly little by variance inequalities. Since this test is also known to be very insensitive to non-normality it would be best to accept the fact that it can be used safely under most practical conditions. To make the preliminary test on variances is rather like putting to sea in a row boat to find out whether conditions are sufficiently calm for an ocean liner to leave port! (Box, 1953)
To achieve 80% power for a = 0.05 (two-sided), each group should have:
n = (16 s� / d�) + 1
where d = a "difference worth detecting" and s = a good estimate of within-group standard deviation (e.g., sp). Suppose we want to detect a difference of 25 units and assume the standard deviation of the outcome variable is 45. Then, the required sample size per group, n = (16)(45�) / 25� + 1 = 52.84 @ 53.
Power is the probability of achieving a "significant" result under a given set of assumptions, assuming H0 is false. For example, we might ask "What is the probability of achieving statistical significance at a = .05 (two-sided) assuming �1 = 50, �2 = 40, s = 45, and n1 = n2 = 20. The answer to this is ".10," meaning the test had only a 10% of rejecting the incorrect alternative hypothesis. Try using the Web power calculator located at http://www.health.ucalgary.ca/~rollin/stats/ssize/n2.html to calculate power for the type of problem presented in this chapter.
(1) TWOGRPS.ZIP. Scores from Two Groups. Two groups demonstrate the following scores on a psychological profile test:
Group 1: 86, 99, 96, 95, 72, 73, 95, 125, 97, 95
Group 2: 110, 126, 89, 106, 98, 105, 93, 127, 130, 92
Computerize these data remembering to create separate variables for SCORE and GROUP and, then, compute the descriptive and inferential statistics described in this chapter. Report on your findings using plain language.
(2) FEV.ZIP (Rosner, 1990, p 40; Tager et al., 1985). Data are from a respiratory health survey of children and adolescents. Codes in
the file are as follows:
Variable | Type | Len | Description |
ID | Integer | 5 | Identification number |
AGE | Integer | 2 | Age of participant at beginning of the study (years) |
FEV | Real (#.####) | 6 | Forced expiratory volume (liters/second) |
HEIGHT | Real (##.#) | 4 | Height (inches) |
SEX | Integer | 1 | Sex: 0 = female, 1 = male |
SMOKE | Integer | 1 | Current smoking status: 0 = non-smoker, 1 = smoker |
Compare the smokers and non-smokers in this file with respect to their age.