Continuous Outcome, Two Independent Groups

Background | Descriptive Statistics | Inferential Statistics | Power and Sample Size | Exercises

Background

This chapter considers the comparison of a continuous outcomes from two independent groups.

Illustrative data WCGS.ZIP (Selvin, 1991, p. 41). To illustrate techniques, we consider cholesterol levels (mg/dl) in Type A and Type B men. Data are:

Type A: 233, 291, 312, 250, 246, 197, 268, 224, 239, 239, 254, 276, 234, 181, 248, 252, 202, 218, 212, 325
Type B: 344, 185, 263, 246, 224, 212, 188, 250, 148, 169, 226, 175, 242, 252, 153, 183, 137, 202, 194, 213 .

Data are structured as a numeric dependent (outcome) variable and a dichotomous independent (group) variable as CHOL and BEHAVIOR, respectively. The first three records and last record of this data set look like this:

REC CHOL BEHAVIOR --- ---- -------- 1 233 A 2 291 A 3 312 A etc. 40 213 B

Descriptive Statistics

Descriptive statistics for the two groups are computed with a two variable MEANS command applied as follows:

EPI6> READ <DATASET> EPI6> MEANS <DV> <IV>

where <DV> represents the name of the dependent variable and <IV> represents the name of the independent variable.

For the illustrative data set, the following commands are issued:

EPI6> READ WCGS EPI6> MEANS CHOL GROUP

Five sections of output are produced (a frequency table, summary statistics, ANOVA table, Bartlett's test, Kruskal-Wallis test). Summary statistics are printed below the frequency table. For the illustrative data, the summary statistics are:

MEANS of CHOL for each category of BEHAVIOR

BEHAVIOR Obs Total Mean Variance Std Dev A 20 4901 245.050 1342.366 36.638B 20 4206 210.300 2336.747 48.340Difference 34.750

BEHAVIOR Minimum 25%ile Median 75%ile Maximum Mode A 181.000 221.000 242.500 261.000 325.000 239.000 B 137.000 179.000 207.000 244.000 344.000 137.000

Thus, n₁ = 20 n₂ = 20 (listed under Obs.) and the type A men in the sample have higher mean scores than type B men (245.1 vs. 210.3). In addition, the type A group had less variability than the type B men (standard deviations: 36.6 vs. 48.3).

Inferential Statistics

Confidence Interval

The observed mean difference (34.750 in this instance) is the point estimate of expected mean difference �₁-�₂. To calculate a 95% confidence interval for �₁-�₂, first calculate (by hand) the standard error of the mean difference as follows:

se = SQRT[(MSW)(1/n₁ + 1/n₂)]

where MSW is the Mean Square Within as reported in Epi Info's ANOVA table:

ANOVA Variation SS df MS F statistic p-value t-value Between 12075.625 1 12075.625 6.564 0.013853 2.562113 Within 69903.150 38 1839.557Total 81978.775 39

For the illustrative data, se = SQRT[(1839.557)(1/20 + 1/20)] = 13.56 mg/dl.

A 95% confidence interval for �₁ - �₂ is given by:

(mean difference) � (t_n_1+n2-2,.975)(se)

where (mean difference) = mean₁ - mean₂, t_n_1+n2-2,.975 represents the 97.5^th percentile of a t distribution with n₁ + n₂ - 2 degrees of freedom (click here for a t table), and se represents the standard error of the mean difference (described above). Thus, the 95% confidence interval for �₁ - �₂ for the illustrative data = (245.05 - 210.30) � (t_38,.975)(13.56) = 34.75 � (2.02)(13.56) = (7.4, 62.1) mg/dl. This interval places the population mean difference between 7.4 and 62.1 with 95% confidence.

Independent t Test

Epi Info calculates the equal variance independent t test for H₀: �₁ = �₂ in its ANOVA table:

Variation SS df MS F statistic p-value t-value Between 12075.625 1 12075.625 6.564 0.013853 2.562113Within 69903.150 38 1839.557 Total 81978.775 39

Thus, data demonstrate t_stat = 2.56 with 38 degrees of freedom (p = .014). Most investigators would consider this "significant" evidence against H₀.

Assumptions of Confidence Interval and t Test

The above confidence interval and test statistics assume data are (1) free of bias (information bias, selection bias, and confounding), (2) groups and individuals within groups are independent, (3) the sampling distribution of the mean difference is normal, and (4) variances in the two populations are equal (homoscedasticity). Although violation of assumptions (3) and (4) may results, numerous studies have shown that these methods allow for considerable departures from normality and equal variance while still providing stable results. The robustness of these last two assumptions is good when samples sizes are equal (n₁ = n₂), samples are large (n > 30), and a two-sided test is used (Zar, 1996, p. 128). Furthermore, statistical tests need not be realistic in order to be useful.

Statistical models are sometimes misunderstood in epidemiology. Statistical models are never true. The question of whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model. (Zeger 1991).

Mann-Whitney / Kruskal-Wallis Test

The Mann-Whitney / Kruskal-Wallis test (for two sample) are non-parametric analogues of the independent t test. Epi Info computes the Kruskal-Wallis test as part of its MEANS command. Here are the results for the illustrative data:

Mann-Whitney or Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups) Kruskal-Wallis H (equivalent to Chi square) = 6.333 Degrees of freedom = 1 p value = 0.011853

Thus, c²_stat = 6.33 with 1 degree of freedom (p = 0.012).

Comment: The Kruskal-Wallis procedure is slightly less powerful than the independent t test when data come from normally distributed populations. The loss of efficiency is surprising small when the test is used in non-normal populations.

Bartlett's Test

When addressing two samples, Bartlett's test addresses H₀: s�₁ = s�₂, where s�_i represents the variance in population i. Epi Info performs this test whenever the MEANS command is used. Here are the results for the illustrative data:

Bartlett's test for homogeneity of variance Bartlett's chi square = 1.404 deg freedom = 1 p-value = 0.236005

Thus, c² = 1.40, df = 2, p = .24. This provides little or no support for rejecting H₀.

Comment: Bartlett's test is reliable only when used in normal populations. When the population distribution is platykurtic, the true p value is less than the calculated p value, i.e., the test is conservative (Maurais & Ouimet, 1986). When population distribution is leptokurtic, the true p value is greater than calculated p value, i.e., the test is liberal . Because t tests are relatively reliable in the face of unequal variance, many statisticians question the use of Bartlett's test as a prequel test to the independent t test. Consider:

It has been shown that in the commonly occurring case in which group sizes are equal, or not very different, the [independent t test] is affected surprisingly little by variance inequalities. Since this test is also known to be very insensitive to non-normality it would be best to accept the fact that it can be used safely under most practical conditions. To make the preliminary test on variances is rather like putting to sea in a row boat to find out whether conditions are sufficiently calm for an ocean liner to leave port! (Box, 1953)

Power and Sample Size

Sample Size Requirements

To achieve 80% power for a = 0.05 (two-sided), each group should have:

n = (16 s� / d�) + 1

where d = a "difference worth detecting" and s = a good estimate of within-group standard deviation (e.g., s_p). Suppose we want to detect a difference of 25 units and assume the standard deviation of the outcome variable is 45. Then, the required sample size per group, n = (16)(45�) / 25� + 1 = 52.84 @ 53.

Power

Power is the probability of achieving a "significant" result under a given set of assumptions, assuming H₀ is false. For example, we might ask "What is the probability of achieving statistical significance at a = .05 (two-sided) assuming �₁ = 50, �₂ = 40, s = 45, and n₁ = n₂ = 20. The answer to this is ".10," meaning the test had only a 10% of rejecting the incorrect alternative hypothesis. Try using the Web power calculator located at http://www.health.ucalgary.ca/~rollin/stats/ssize/n2.html to calculate power for the type of problem presented in this chapter.

Exercises

(1) TWOGRPS.ZIP. Scores from Two Groups. Two groups demonstrate the following scores on a psychological profile test:

Group 1: 86, 99, 96, 95, 72, 73, 95, 125, 97, 95
Group 2: 110, 126, 89, 106, 98, 105, 93, 127, 130, 92

Computerize these data remembering to create separate variables for SCORE and GROUP and, then, compute the descriptive and inferential statistics described in this chapter. Report on your findings using plain language.

(2) FEV.ZIP (Rosner, 1990, p 40; Tager et al., 1985). Data are from a respiratory health survey of children and adolescents. Codes in the file are as follows:

Variable Type Len Description

ID Integer 5 Identification number

AGE Integer 2 Age of participant at beginning of the study (years)

FEV Real (#.####) 6 Forced expiratory volume (liters/second)

HEIGHT Real (##.#) 4 Height (inches)

SEX Integer 1 Sex: 0 = female, 1 = male

SMOKE Integer 1 Current smoking status: 0 = non-smoker, 1 = smoker

Compare the smokers and non-smokers in this file with respect to their age.

Variable	Type	Len	Description
ID	Integer	5	Identification number
AGE	Integer	2	Age of participant at beginning of the study (years)
FEV	Real (#.####)	6	Forced expiratory volume (liters/second)
HEIGHT	Real (##.#)	4	Height (inches)
SEX	Integer	1	Sex: 0 = female, 1 = male
SMOKE	Integer	1	Current smoking status: 0 = non-smoker, 1 = smoker