14.1 Distinctions. Identify which of the statements are true and which are false.
14.2 Memory of food intake. Retrospective studies on diet and health often rely on recall of distant dietary histories for at least part of their data. It is well known that the accuracy of such information is suspect. To study this issue, Dwyer et al. (1989) asked middle-aged adults (median age 50) to recall food intakes at ages 5-7 years, 18 years, and 30 years using food frequency questionnaires. This information was then compared to historical information collected during these earlier time periods. Correlation between recalled and historical consumption for foods and food groups rarely exceeded r = 0.3. Based on this information, what do you conclude about the reliability of memory as a means to measure past food intake?
14.3 Doll's ecological study of smoking and lung cancer. In 1955, Richard Doll published an ecological study of smoking and lung cancer. Smoking was measured as per capita cigarette consumption in 1930 (CIG). Lung cancer mortality per 100,000 person-years in 1950 (LUNGCA). Data may be downloaded as doll-ecol.sav and are shown in the table below. (There are 11 observations. The data table may split across pages.)
(A) Construct a scatterplot of the relation between cigarette consumption and lung cancer.
Consider the form, direction, and strength of the relationship. Are
there any outliers? [Hints: 1) Make certain you put the explanatory variable on the horizontal axis.
2) se The APA
Publication Guide �3.75 - �3.77 for guidelines on figure production.]
(B) Calculate the correlation coefficient for the problem. Interpret this
statistic.
(C) Test the correlation coefficient for significance. Show all
hypothesis testing steps (null hypothesis statement, test statistic, P
value, conclusion).
(D) Optional: Replicate all analyses in SPSS. Label your output. [Menu choices: Graph >
Scatter and Analyze > Correlate > Bivariate.]
(E) What % of LUNGCA is "explained" by CIG?
i COUNTRY CIG LUNGCA
1 USA 1300 20
2 Great Brit 1100 46
3 Finland 1100 35
4 Switzerland 510 25
5 Canada 500 15
6 Holland 490 24
7 Australia 480 18
8 Denmark 380 17
9 Sweden 300 11
10 Norway 250 9
11 Iceland 230 6
14.4 Sodium and blood pressure. Data (n = 10) on daily SODIUM intake (mg) and systolic blood pressure (BP; mm Hg) are stored in na-bp.sav and are shown below.
(A) Which variable is the explanatory variable in this analysis? Which is
the response variable?
(B) Construct a scatter plot of these data. (Use graph paper or a computer
to generate your plot. Axis-labels should conform to APA style guidelines.)
Discuss your plot by considering its form, direction of association, and
strength of association. Are there any outliers?
(C) Compute r. Interpret this statistic.
(D) What % of BP is explained by SODIUM?
(E) Test the correlation for significance. Show all hypothesis testing
steps (null hypothesis, test statistic, P value, conclusion).
i SODIUM
BP
1 6.8 154
2 7.0 167
3 6.9 162
4 7.2 175
5 7.3 190
6 7.0 158
7 7.0 166
8 7.5 195
9 7.3 189
10 7.1 186
14.5 Gravid iguanas. Data on post-partum body weight (kilograms) and the number of eggs produced by gravid iguanas are shown below (Hampton, 1994, p. 157; iguana.sav).
(A) Construct a scatter plot of the data. (Make certain you put the
explanatory variable on the horizontal axis.) Interpret your plot.
(B) Calculate the correlation coefficient. Interpret this statistic.
(C) Test the correlation coefficient for statistical significance.
i WEIGHT
EGGS
1 0.90 33
2 1.55 50
3 1.30 46
4 1.00 33
5 1.55 53
6 1.80 57
7 1.50 44
8 1.05 31
9 1.70 60
14.6 Graduation rates at
Big Ten universities. The most reliable factor that predicts graduate is scholastic aptitude
and motivation. To explore quantify this fact, a researcher collects data on many
factors. Data are stored in bigten.sav. Graduation rates
by university (percentage of students graduating within 5
years of entry) are stored in the variable UPERCENT. The average ACT scores of incoming freshman at
is the predictor variable for this analysis.
(A) Plot these data. Interpret your plot.
(B) Calculate r. Interpret this statistitic.
(C) Test it for statistical significance. Interpret your results.
(D) Calculate r2. What does this tell you about
the variability of graduation rates?
UPERCENT ACT
76.2 27
57.6 24
55.4 24
59.7 23
86.0 28
46.2 22
66.7 23
14.7 Occupational study of smoking and lung cancer. An occupational health study in England looked at the relation between cigarettes smoked and lung cancer mortality in 25 different occupational groups. The explanatory variable (SMOKING) was standardized to 100 when men in the group had typical smoking rates for their age. The response variable was the standardized mortality ratio (SMR) for lung cancer mortality in that occupational group. Data can be seen by clicking here lib.stat.cmu.edu/DASL/Datafiles/SmokingandCancer.html and are stored online in the file occupational_smr.sav.
(A) Plot SMR against SMOKING. Interpret the plot (i.e., consider its form, direction, strength, and if any outliers are present).
(B) Compute the correlation coefficient and interpret this result.
(C) Test the correlation for significance. State the null hypothesis, test statistic, its df, and P value. State your conclusion.
14.8 Maternal mortality and health care during birth. This study explored the relation between the percentage of births attended by physicians, nurses, and midwives (ATTENDED) and maternal mortality per 100,000 live births (MAT_MORT). The values for a random sample of 11 countries are shown below and are stored online in ../datasets/mat_mort.sav. Data are a sample from Pagano & Gauvreau (2000, p. 407) as originally published in United Nation's Children's Fund (1994) [link to review of the UN publication].
COUNTRY ATTENDED MAT_MORT
Bangladesh 5 600
Chile��� 98 67
Iran���� 70 120
Kenya��� 50 170
Nepal��� 6 830
Netherlands 100 10
Nigeria� 37 800
Pakistan 35 500
Panama�� 96 60
United States 99 8
Vietnam� 95 120
(A) What is the independent variable in this
analysis? What is the dependent variable?
(B) Plot the data as a scatterplot. Interpret what you see (form, direction, strength,
outliers if any). Make certain your plot is accurate and labeled in a way that is kind to your reader.
[You are
encouraged to use computational tools when analyzing your data.]
(C) Calculate r. Interpret this statistic.
(D) Test the correlation for statistical significance. Show all
hypothesis testing steps.
(E) Identify lurking variables that may confound and observed
relationship. Explain how confounding may occur.
14.9 Need and demand for mental health care. This example uses data from a 1854 study on mental health care in the fourteen counties in Massachusetts in the prior century. The study conducted by Edward Jarvis. Jarvis, then president of the American Statistical Association. The explanatory variable is the reciprocal of the distance (in miles-1) to the nearest mental healthcare center (REC_DIST). The response variable is the percent of patients cared for in the home (PHOME). The relation between the percentage of patients cared for at home and distance to the nearest health care center remains important today--it is still recommended that numerous small mental hospitals be erected at scattered locations rather than having one large central facility [Source: http://lib.stat.cmu.edu/DASL/Stories/lunatics.html and http://lib.stat.cmu.edu/DASL/Datafiles/lunaticsdat.html]
(A) Create a scatterplot of the relation between PHOME and REC_DIST. Describe the relationship. Are there any outliers?
(B) Calculate the correlation coefficient using all 14 data points.
(C) Nantucket is clearly an outlier in this data set. Remove this outlier from the dataset and recalculate the correlation coefficient. Did this improve the "fit" of the correlation model?
COUNTY | PHOME | REC_DIST |
BERKSHIRE | 77.00 | .01031 |
FRANKLIN | 81.00 | .01613 |
HAMPSHIRE | 75.00 | .01852 |
HAMPDEN | 69.00 | .01923 |
WORCESTER | 64.00 | .05000 |
MIDDLESEX | 47.00 | .07143 |
ESSEX | 47.00 | .10000 |
SUFFOLK | 6.00 | .25000 |
NORFOLK | 49.00 | .07143 |
BRISTOL | 60.00 | .07143 |
PLYMOUTH | 68.00 | .06250 |
BARNSTABLE | 76.00 | .02273 |
NANTUCKET | 25.00 | .01299 |
DUKES | 79.00 | .01923 |
14.10
Cancer
correlates.
Statistical packages are able to calculate correlations for multiple
pairings of variables, often reporting their findings in a correlation
matrix.
Correlatoin matrices report correlation coefficients for all pairing of
quantitative variables. We are going to create a correlation matrix for
the per capita numbers of cigarettes smoked (sold) in 43 states and the
District of Columbia in 1960 and death rates for various forms of cancer. The
data, originally from Fraumeni
et al.(1968), can be download as an SPSS
data set or text
file by right-clicking on the highlighted text. Use SPSS to
calculate correlation
coefficients for each variable pairing. Interpret the correlation coefficients.
Which cancers are associated with smoking?
Variable
Description
CIG
cigarettes sold per capita
BLAD
bladder cancer deaths per 100,000
LUNG
lung cancer deaths per 100,000
KID
kidney cancer deaths per 100,000
LEUK
leukemia cancer deaths per 100,000
14.11 Atherosclerotic heart disease as a function of fat calories (fat_cal.sav). Following World War II, it became clear that northern European countries with high dietary fat consumption were experiencing notable increases in what was then called degenerative heart disease. Data in this exercise are a fictionalized version of data from early ecological studies reported by Keys (1952, also see EKS p. 195). Data for calories from fat as a % of total calories (FAT_CAL) and CHD mortality per 1000 50- to 59-year-olds are:
COUNTY | FAT_CAL | CHD |
Japan | 8 | 0.5 |
Italy | 20 | 1.4 |
England | 33 | 3.8 |
Australia | 36 | 5.5 |
Canada | 37 | 5.7 |
USA | 39 | 7.1 |
(A) Which of the variables in this data set is the independent variable? Which is the dependent (response) variable?
(B) Plot the data.
(C) Can the relation be described with a straight line?
(D) ...to be continued...
Key to Odd Numbered Problems Key to Even Numbered Problems (may not be posted)