| Background | Descriptive Statistics | Test of Independence | Review Questions | Exercises | References |Background
In this chapter we compare a nominal outcome among several groups. The study outcome is stored as a text or numerically-encoded dependent variable. The study predicto is is stored in a separate text or numerically-encoded independent variable.
Illustrative example. Techniques in this chapter will be illustrated with the case-control data set BD1.REC, which consists of 200 cases with esophageal cancer and 775 neighborhood-matched controls. The independent variable (defining groups) is CASE (coded 1 = case, 2 = control). The dependent variable (i.e., the variable allowed to vary according to the case-control sampling method) is ALC (alcohol consumption, coded: 1 = 0-39 gms/day, 2 = 40-79 gms/day, 3 = 80-119 gms/day, 4 = 120+ gms/day). The data comes from a study by Tuyns et al. (1977) as reported by Breslow & Day (1980, p. 123). The first three records and last record are:
Record | CASE | ALC |
1 | 2 | 1 |
2 | 2 | 1 |
3 | 2 | 1 |
. | . | . |
975 | 1 | 4 |
Suggestion: Download the data set, unzip it and open the file. View its content and take note of its structure.
Before data are tested, they are cross-tabulated to form of an r-by-c contingency table, where r represents the number of rows in the table and c represent the number of columns. For our illustrative example, cross-tabulated, are:
Alcohol Consumption (grams / day) |
Cases (n = 200) | Controls (n = 775) |
0-39 | 29 | 386 |
40-79 | 75 | 280 |
80-119 | 51 | 87 |
120+ | 45 | 22 |
Notice that this data might just as easily have been set up as a 2-by-4 contingency table, having case status represented along rows and alcohol consumption level represented along columns. However, this would not materially change the data or conclusions to follow.
For the sake of consistency in this chapter, let us arrange our tables with the dependent variable listed down table rows and the independent variable (groups) listed across table columns. Let ni represent the size of group i {i: 1, 2, . . . c), and let mj represent the number of people falling into category j (j: 1, 2, . . ., r). The total sample size is N, and the number of people in group i who fall into category j is xi,j. Table cells labels are therefore:
Variable R (dependent var) | Variable C (independent var) | ||||
1 | 2 | ... | c | Total | |
1 | x1,1 | x1,2 | ... | x1,c | m1 |
2 | x2,1 | x2,2 | ... | x2,c | m2 |
. | ... | ... | ... | ... | . |
r | xr,1 | xr,2 | ... | xr,c | mr |
Total | n1 | n2 | ... | nc | N |
To cross-tabulate data in EpiInfo, issue the commands:
EPI6> READ <x:\path\dataset.rec>
EPI6> TABLES <rowvar> <columnvar>
For example, to cross-tabulate the illustrative example, issue the commands:
EPI6> READ A:\BD1
EPI6> TABLES ALC CASE
Output is:
CASE
ALC | 1 2 | Total
-----------+---------------+------
1 | 29 386 | 415
2 | 75 280 | 355
3 | 51 87 | 138
4 | 45 22 | 67
-----------+---------------+------
Total | 200 775 | 975
Suggestion: Cross-tabulate the data in Epi Info.
We want to compute relative frequencies (percentages) within groups. The percentage of people in group i with characteristic j = xi,,j / ni. For example, the number of people in group 1 falling into category 1 = x1,1 / n1. In the illustrative example, x1,1 = 29 / 200 = 14.5%.
To have Epi Info calculate cell percentages, issue the command:
EPI6> SET PERCENTS = ON
Then reissue the TABLES command:
EPI6> TABLES ALC CASE
Output is:
CASE
ALC | 1 2 | Total
-----------+---------------+------
1 | 29 386 | 415
> 7.0% 93.0% > 42.6%
| 14.5% 49.8% |
2 | 75 280 | 355
> 21.1% 78.9% > 36.4%
| 37.5% 36.1% |
3 | 51 87 | 138
> 37.0% 63.0% > 14.2%
| 25.5% 11.2% |
4 | 45 22 | 67
> 67.2% 32.8% > 6.9%
| 22.5% 2.8% |
-----------+---------------+------
Total | 200 775 | 975
| 20.5% 79.5% |
Observe that each cell now displays counts and two percentages: the row percent (indicated with a ">") shows the cell count as a percentage of the row total. The column percent shows the cell count as a percentage of the column total.Since we are currently interested in percentages within groups, we should focus on column percentages. Notice that cases are more likely than controls to fall into high ALC levels.
Inferential methods in this chapter are based on the chi-square probability function, as introduced by Karl Pearson (circa 1900) with later development by his son-in-law, R. A. Fisher. Chi-square distributions are asymmetrical probability functions with long right tails. The area under the distribution is used to quantify the probability of random occurrences. Although chi-square distributions have many uses, this chapter focuses on their use in testing whether joint probabilities of discrete occurrence are independent. This test is called the chi-squared test of independence.
The null hypothesis is that the row and column variables are independent. The alternative hypothesis is that the row and column variables are dependent. This is equivalent to:
H0: no significant association between row and column variables
H1: association between the row and column variables
The alpha level is set before submitting data to testing. In this instance, let alpha = .01 (just for a change).
The chi-square test is used to perform the test. This test is based on a comparison of observed cell counts to expected counts. Expected counts represent hypothetical values that would occur if there were no association between the variables being tested. The expected count in each table cell is calculated:
expected count = (row total * column total / N)
For the illustrative example, expected counts are:
Alcohol Consumption
(grams / day) |
Esophageal Cancer | ||
Yes | No | Total | |
0-39 | (415 * 200) / 975 = 85.13 | (415 * 775) / 975 = 329.87 | 415 |
40-79 | (335 * 200) / 975 = 72.82 | (355 * 775) / 975 = 282.18 | 355 |
80-119 | (138 * 200) / 975 = 28.31 | (138 * 775) / 975 = 109.69 | 138 |
120+ | (67 * 200) / 975 = 13.74 | (67 * 775) / 975 = 53.26 | 67 |
Total | 200 | 775 | 975 |
The "observed - expected values" are called residuals.
residual = observed - expected
The residuals for the illustrative example are:
Alcohol Consumption | case | control |
0 - 39 gm/day | 29 - 85.1 = -56.1 | 386 - 329.9 = 56.1 |
40 - 79 gm/day | 75 - 72.8 = 2.2 | 280 - 282.2 = -2.2 |
80 - 119 gm/day | 51 - 28.3 = 22.7 | 87 - 109.7 = -22.7 |
120+ gm/day | 45 - 13.7 = 31.3 | 22 - 53.3 = -31.3 |
These residuals let the user know how much each observed value is deviating from expected.
Pearson's chi-square test statistic is calculated:
Pearson's chi-square = SUM [residual2 / expected]
For the illustrative example, Pearson's chi-square = [-56.12/85.1 + 56.12/329.9 + 2.22/72.8 + -2.22/282.8 + 22.72/28.3 + -22.72/109.7 + 31.32/13.7 + -31.32/ 53.3] = 36.98 + 9.54 + 0.07 + 0.02 + 18.21 + 4.70 + 71.51 + 18.38 =159.41
Under the null hypothesis, Peason's chi-square statistic has a chi-square distribution with (r -1)(c -1) degrees of freedom, where r represents the number of rows in the table and c represents the number of columns. Since we are currently testing a 4-by-2 table, df = (4-1)(2-1) = 3. This information is used to compute a p value for the problem. For the illustrative example, the chi-square statistic = 159.00 with 3 df, p < .0005.
Pearson's chi-square statistic is computed by Epi Info whenever data are cross-tabulated. Output for the illustrative example is:
Chi square = 158.95
Degrees of freedom = 3
p value = 0.00000000 <---
The chi-square statistic, its associated degrees of freedom, and sample size basis (N) are usually reported when presenting chi-square information. The APA Publications Manual suggests the following format: X�(df, N = xxx) = xx.xx, p = .xxx. The APA format for the illustrative example is: X�(3, N = 975) = 159.00, p < .0005.
In instances when data are presented in cross-tabulated form, use the web r�c Contingency Table calculator at http://www.physics.csbsju.edu/stats/contingency.html to calculate chi-square statistics.
The chi-square test of independence is based on the compilation of normal approximations, and hence assumes expected cell counts to be greater than or equal to 5. When this assumption is not met, alternative tests based on the binomial distribution must be used (e.g., Fisher's exact test)
(1) Fill in the blank: With a continuous outcome, descriptive statistics are based on sums and averages. With a categorical outcome, descriptive statistics are based on _________________ and ___________________.
ANS: counts and proportions
(2) What type of test is used to determine statistical significance between a continuous dependent variable and categorical independent variable?
ANS: An independent t test or analysis of variance.
(3) What type of test is used to determine statistical significance between a continuous dependent variable and continuous independent variable?
ANS: A t test can be used via the regression model.
(4) What type of test is used to determine statistically significance between a categorical dependent variable and categorical independent variable?
ANS: A chi-square test, as described in this chapter.
(5) List the (two-sided) null hypotheses used by each of the tests listed in (2) - (4), above.
ANS:
Independent t test: H0: �1 = �2
Regression test: H0: $1 = 0
Chi-square test of independent: H0: "no association"
(6) List the assumptions required of each of the above tests.
ANS: Using short descriptors,
Independent t test: Independence, Normality, Equal Variance
Regression test: Linearity, Independence, Normality, Equal Variance
Chi-square test: Independence, Expected Values >= 5
Data come from a survey of smoking and socioeconomic status. Five socioeconomic status groups are considered, with group 1 representing the lowest SES and group 5 representing the highest. Cigarette smoking status is categorized as 1 = current smoker, 2 = non-smoker. Data have already been cross-tabulated, as follows:
Socio-Economic Status | |||||
1 | 2 | 3 | 4 | 5 | |
1 (smoker) | 17 | 76 | 34 | 32 | 20 |
2 (non-smoker) | 40 | 195 | 88 | 53 | 30 |
You are familiar with this data set from previous examples. Briefly, data are from a respiratory health survey of children and adolescents from the East Boston, MA, area. Let us now focus on the relationship between SMOKE (dependent variable) and SEX (independent variable).
American Psychological Association [APA]. (1994). Publication Manual (4th ed.). Washington, DC: Author.
Chang, C. L., Selvin, S., Langhauser, C. (1983). Biology and Public Health Statistics: BioEnv 130A. Unpublished instructional material, University of California, Berkeley.
Breslow, N. E., & Day, N. E. (1980). Statistical Methods in Cancer Research. Volume 1--The Analysis of Case-Control Studies. Lyon: International Agency for Research on Cancer.
Tuyns, A. J., Péquignot, G., & Jensen, O. M. (1977). Le cancer de l'oesophage en Ille-et Vilaine en function des niveaux de consommation d'alcool et de tabac. Des risques qui se multiplient. Bull Cancer, 64, 45 - 60.
Zar, J. H. (1996). Biostatistical Analysis. (3rd Ed.) Upper Saddle River, NJ: Prentice Hall.