Categorical Outcome, Independent Groups

Background

In this chapter we compare a nominal outcome among several groups. The study outcome is stored as a text or numerically-encoded dependent variable. The study predicto is is stored in a separate text or numerically-encoded independent variable.

Illustrative example. Techniques in this chapter will be illustrated with the case-control data set BD1.REC, which consists of 200 cases with esophageal cancer and 775 neighborhood-matched controls. The independent variable (defining groups) is CASE (coded 1 = case, 2 = control). The dependent variable (i.e., the variable allowed to vary according to the case-control sampling method) is ALC (alcohol consumption, coded: 1 = 0-39 gms/day, 2 = 40-79 gms/day, 3 = 80-119 gms/day, 4 = 120+ gms/day). The data comes from a study by Tuyns et al. (1977) as reported by Breslow & Day (1980, p. 123). The first three records and last record are:

Record	CASE	ALC
1	2	1
2	2	1
3	2	1
.	.	.
975	1	4

Suggestion: Download the data set, unzip it and open the file. View its content and take note of its structure.

Descriptive Statistics

Before data are tested, they are cross-tabulated to form of an r-by-c contingency table, where r represents the number of rows in the table and c represent the number of columns. For our illustrative example, cross-tabulated, are:

Alcohol Consumption (grams / day)	Cases (n = 200)	Controls (n = 775)
0-39	29	386
40-79	75	280
80-119	51	87
120+	45	22

Notice that this data might just as easily have been set up as a 2-by-4 contingency table, having case status represented along rows and alcohol consumption level represented along columns. However, this would not materially change the data or conclusions to follow.

For the sake of consistency in this chapter, let us arrange our tables with the dependent variable listed down table rows and the independent variable (groups) listed across table columns. Let n_i represent the size of group i {i: 1, 2, . . . c), and let m_j represent the number of people falling into category j (j: 1, 2, . . ., r). The total sample size is N, and the number of people in group i who fall into category j is x_i,j. Table cells labels are therefore:

Variable R (dependent var)	Variable C (independent var)
Variable R (dependent var)	1	2	...	c	Total
1	x_1,1	x_1,2	...	x_1,c	m₁
2	x_2,1	x_2,2	...	x_2,c	m₂
.	...	...	...	...	.
r	x_r,₁	x_r,2	...	x_r,c	m_r
Total	n₁	n₂	...	n_c	N

To cross-tabulate data in EpiInfo, issue the commands:

EPI6> READ <x:\path\dataset.rec> EPI6> TABLES <rowvar> <columnvar>

For example, to cross-tabulate the illustrative example, issue the commands:

EPI6> READ A:\BD1 EPI6> TABLES ALC CASE

Output is:

CASE ALC | 1 2 | Total -----------+---------------+------ 1 | 29 386 | 415 2 | 75 280 | 355 3 | 51 87 | 138 4 | 45 22 | 67 -----------+---------------+------ Total | 200 775 | 975

Suggestion: Cross-tabulate the data in Epi Info.

We want to compute relative frequencies (percentages) within groups. The percentage of people in group i with characteristic j = x_i,,j / n_i. For example, the number of people in group 1 falling into category 1 = x_1,1 / n₁. In the illustrative example, x_1,1 = 29 / 200 = 14.5%.

To have Epi Info calculate cell percentages, issue the command:

EPI6> SET PERCENTS = ON

Then reissue the TABLES command:

EPI6> TABLES ALC CASE

Output is:

CASE ALC | 1 2 | Total -----------+---------------+------ 1 | 29 386 | 415 > 7.0% 93.0% > 42.6% | 14.5% 49.8% | 2 | 75 280 | 355 > 21.1% 78.9% > 36.4% | 37.5% 36.1% | 3 | 51 87 | 138 > 37.0% 63.0% > 14.2% | 25.5% 11.2% | 4 | 45 22 | 67 > 67.2% 32.8% > 6.9% | 22.5% 2.8% | -----------+---------------+------ Total | 200 775 | 975 | 20.5% 79.5% |

Observe that each cell now displays counts and two percentages: the row percent (indicated with a ">") shows the cell count as a percentage of the row total. The column percent shows the cell count as a percentage of the column total.Since we are currently interested in percentages within groups, we should focus on column percentages. Notice that cases are more likely than controls to fall into high ALC levels.

Test of Independence

Inferential methods in this chapter are based on the chi-square probability function, as introduced by Karl Pearson (circa 1900) with later development by his son-in-law, R. A. Fisher. Chi-square distributions are asymmetrical probability functions with long right tails. The area under the distribution is used to quantify the probability of random occurrences. Although chi-square distributions have many uses, this chapter focuses on their use in testing whether joint probabilities of discrete occurrence are independent. This test is called the chi-squared test of independence.

The null hypothesis is that the row and column variables are independent. The alternative hypothesis is that the row and column variables are dependent. This is equivalent to:

H₀: no significant association between row and column variables
H₁: association between the row and column variables

The alpha level is set before submitting data to testing. In this instance, let alpha = .01 (just for a change).

The chi-square test is used to perform the test. This test is based on a comparison of observed cell counts to expected counts. Expected counts represent hypothetical values that would occur if there were no association between the variables being tested. The expected count in each table cell is calculated:

expected count = (row total * column total / N)

For the illustrative example, expected counts are:

Alcohol Consumption (grams / day)	Esophageal Cancer
Alcohol Consumption (grams / day)	Yes	No	Total
0-39	(415 * 200) / 975 = 85.13	(415 * 775) / 975 = 329.87	415
40-79	(335 * 200) / 975 = 72.82	(355 * 775) / 975 = 282.18	355
80-119	(138 * 200) / 975 = 28.31	(138 * 775) / 975 = 109.69	138
120+	(67 * 200) / 975 = 13.74	(67 * 775) / 975 = 53.26	67
Total	200	775	975

The "observed - expected values" are called residuals.

residual = observed - expected

The residuals for the illustrative example are:

Alcohol Consumption	case	control
0 - 39 gm/day	29 - 85.1 = -56.1	386 - 329.9 = 56.1
40 - 79 gm/day	75 - 72.8 = 2.2	280 - 282.2 = -2.2
80 - 119 gm/day	51 - 28.3 = 22.7	87 - 109.7 = -22.7
120+ gm/day	45 - 13.7 = 31.3	22 - 53.3 = -31.3

These residuals let the user know how much each observed value is deviating from expected.

Pearson's chi-square test statistic is calculated:

Pearson's chi-square = SUM [residual² / expected]

For the illustrative example, Pearson's chi-square = [-56.1²/85.1 + 56.1²/329.9 + 2.2²/72.8 + -2.2²/282.8 + 22.7²/28.3 + -22.7²/109.7 + 31.3²/13.7 + -31.3²/ 53.3] = 36.98 + 9.54 + 0.07 + 0.02 + 18.21 + 4.70 + 71.51 + 18.38 =159.41

Under the null hypothesis, Peason's chi-square statistic has a chi-square distribution with (r -1)(c -1) degrees of freedom, where r represents the number of rows in the table and c represents the number of columns. Since we are currently testing a 4-by-2 table, df = (4-1)(2-1) = 3. This information is used to compute a p value for the problem. For the illustrative example, the chi-square statistic = 159.00 with 3 df, p < .0005.

Pearson's chi-square statistic is computed by Epi Info whenever data are cross-tabulated. Output for the illustrative example is:

Chi square = 158.95 Degrees of freedom = 3 p value = 0.00000000 <---

Reporting the Chi-Square Statistic

The chi-square statistic, its associated degrees of freedom, and sample size basis (N) are usually reported when presenting chi-square information. The APA Publications Manual suggests the following format: X�(df, N = xxx) = xx.xx, p = .xxx. The APA format for the illustrative example is: X�(3, N = 975) = 159.00, p < .0005.

Web Calculator

In instances when data are presented in cross-tabulated form, use the web r�c Contingency Table calculator at http://www.physics.csbsju.edu/stats/contingency.html to calculate chi-square statistics.

Assumption of the Chi-square Test of Independence

The chi-square test of independence is based on the compilation of normal approximations, and hence assumes expected cell counts to be greater than or equal to 5. When this assumption is not met, alternative tests based on the binomial distribution must be used (e.g., Fisher's exact test)

Review Questions

(1) Fill in the blank: With a continuous outcome, descriptive statistics are based on sums and averages. With a categorical outcome, descriptive statistics are based on _________________ and ___________________.

ANS: counts and proportions

(2) What type of test is used to determine statistical significance between a continuous dependent variable and categorical independent variable?

ANS: An independent t test or analysis of variance.

(3) What type of test is used to determine statistical significance between a continuous dependent variable and continuous independent variable?

ANS: A t test can be used via the regression model.

(4) What type of test is used to determine statistically significance between a categorical dependent variable and categorical independent variable?

ANS: A chi-square test, as described in this chapter.

(5) List the (two-sided) null hypotheses used by each of the tests listed in (2) - (4), above.

ANS:
Independent t test: H₀: �₁ = �₂
Regression test: H₀: $₁ = 0
Chi-square test of independent: H₀: "no association"

(6) List the assumptions required of each of the above tests.

ANS: Using short descriptors,
Independent t test: Independence, Normality, Equal Variance
Regression test: Linearity, Independence, Normality, Equal Variance
Chi-square test: Independence, Expected Values >= 5

Exercises

(1) SESSMOKE: Prevalence of Smoking by Socioeconomic Status (Chang, et al., 1983)

Data come from a survey of smoking and socioeconomic status. Five socioeconomic status groups are considered, with group 1 representing the lowest SES and group 5 representing the highest. Cigarette smoking status is categorized as 1 = current smoker, 2 = non-smoker. Data have already been cross-tabulated, as follows:

Socio-Economic Status

1 2 3 4 5

1 (smoker) 17 76 34 32 20

2 (non-smoker) 40 195 88 53 30

(A) Calculate the proportion (prevalence) of smoking within each SES category. (Comment: Column totals must first be calculated. Prevalence represents the relative frequency of smoking within each category group.)

(B) By hand, calculate table of expected value. Are any expected counts less than 5?

(C) Using the Web calculator, test the hypothesis of no association. State the null and alternative hypotheses. Let alpha = .05. Report the chi-square statistic, its degrees of freedom, and the total sample size, using the suggested APA format. State your conclusion. Is there a significant difference in the proportion of smokers by SES?

(2) BD1.REC: Tobacco Use in Esophageal Cancer Patients and Control (Breslow & Day, 1980)

You are familiar with this data set from its illustrative use in the chapter. In addition to alcohol consumption, this study considered tobacco consumption (variable TOB: 1 = 0-9 gms/day, 2 = 40-79 gms/day, 3 = 20-29 gms/day, 4 = 30+ gms/day). The case status of subjects is contained in variable CASE (1 = case, 2 = control).: (A) Cross-tabulate the data, list tobacco consumption status along rows and case status along columns. Report the cross-tabulation.; (B) Report the distribution of tobacco consumption percentages by case status.; (C) Test the hypothesis of no association. State the null and alternative hypotheses. Let alpha = .05. Report the chi-square statistic, its degrees of freedom, and the total sample size, using the suggested APA format. State your conclusion. Is there a significant difference in tobacco consumption in cases and controls? How so (i.e., which group tends to consume more tobocco)?

(3) FEV.REC: Smoking in Boys and Girls

You are familiar with this data set from previous examples. Briefly, data are from a respiratory health survey of children and adolescents from the East Boston, MA, area. Let us now focus on the relationship between SMOKE (dependent variable) and SEX (independent variable).

(A) Cross-tabulate the data, listing the dependent variable along the row and independent variable along column of the table.

(B) Report the proportion of boys and girls who smoke.

(C) Perform a test of association. State the null and alternative hypotheses. Let alpha = .05. Report the chi-square statistic, its degrees of freedom, and the total sample size. (Use APA format, if possible.) State your conclusion. Are data significant? If so, how?

References

American Psychological Association [APA]. (1994). Publication Manual (4th ed.). Washington, DC: Author.

Chang, C. L., Selvin, S., Langhauser, C. (1983). Biology and Public Health Statistics: BioEnv 130A. Unpublished instructional material, University of California, Berkeley.

Breslow, N. E., & Day, N. E. (1980). Statistical Methods in Cancer Research. Volume 1--The Analysis of Case-Control Studies. Lyon: International Agency for Research on Cancer.

Tuyns, A. J., Péquignot, G., & Jensen, O. M. (1977). Le cancer de l'oesophage en Ille-et Vilaine en function des niveaux de consommation d'alcool et de tabac. Des risques qui se multiplient. Bull Cancer, 64, 45 - 60.

Zar, J. H. (1996). Biostatistical Analysis. (3rd Ed.) Upper Saddle River, NJ: Prentice Hall.

	Socio-Economic Status
	1	2	3	4	5
1 (smoker)	17	76	34	32	20
2 (non-smoker)	40	195	88	53	30