Michael Kahn

Wheaton College

Journal of Statistics Education Volume 13, Number 2 (2005), www.amstat.org/publications/jse/v13n2/datasets.kahn.html

Copyright © 2005 by Michael Kahn, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

**Key Words:**Forced expiratory volume; Semester-long discussions; Statistics education.

Statisticians know that good analyses require time and effort to learn about numerous issues concerning the science and the
data. There are too many disheartening stories of scientists who took a typical introductory statistics class, often their
only formal exposure to statistics, who later in their career seek statistical help thinking A) *this must be an easy
question*, meaning they believe there is a menu option in their favorite software for obtaining the p-value they wish to
report, or B) *modeling will not take long once the data are cleaned up*. These scientists sadly learned from their
introductory statistics courses that statistical analyses are quick (as quick as the next assignment's due date) and easy
to do once the data are ready and the appropriate methods and software menu options are identified. This article discusses
the use of a scientific problem whose statistical analyses take time and consideration while remaining immediately
accessible to students at all levels and providing relevant motivation for their attention.

Rosner is a coauthor on the two Tager, et al, manuscripts cited above. In his book, Rosner (1999) presents data for yet another analysis from the data in the Tager studies. In the problem presented in Rosner the investigation concerns measuring the subjects' respiratory function, as well as whether the subjects themselves smoke. This new data analysis, which is the one students are asked to carry out, is not a longitudinal study of second-hand smoking effects, but a cross-sectional subset from the Tager studies for investigating the relationship between subjects FEV and their current smoking status. In this problem the measured outcome of interest is forced expiratory volume (FEV), which is, essentially, the amount of air an individual can exhale in the first second of a forceful breath. The data recorded in the dataset include the following: FEV (liters), AGE (years), HEIGHT (inches), GENDER (M/F), SMOKE (Y/N). (Note that an unencumbered version of this dataset was given in Kahn (2003).)

The dataset is appropriate for teaching statistics at many levels, but most especially introductory statistics, as it provides an accessible, human context for students along with enough statistical substance to provide a lively context for discussions on a variety of general issues. These include:

- Cross-sectional studies versus longitudinal studies
- Observational studies (self-selection) versus randomized trials (random assignment)
- Self-reported (survey) data
- Statistical adjustment for confounding factors (post-sampling) versus adjustment at the time of sampling (weighted, stratified, matched, case-control sampling)
- Appropriate use of data subsets
- Graphical methods for investigating multivariate relationships (simple methods ala Minitab © graphics versus more sophisticated methods such as Trellis graphics (see R Development Core Team, 2004, or Becker, et al, 1996))

After starting to understand what is being investigated, students typically ask questions about the variables contained in this dataset. How were these measurements taken and, for future studies, what might be measured to get a better indication of any effects (i.e. to lower nuisance and/or unexplained variation in the subjects' FEV measurements)? After this they have some idea of what FEV is and how it is measured. Presumably, height and gender are straightforward to measure, but what about smoking? How is it measured? How might it be measured? How is it coded in this dataset? What are some alternatives to this coding? Note that even though the Tager articles consider longitudinal analyses of both direct and second-hand smoking effects on children, the data used in this paper simply consider whether the child reports smoking cigarettes regularly. If the child smokes, then SMOKE is coded 1 and if the child does not smoke, then SMOKE is coded 0.

At this point summaries of the data might be in order. The following numerical summaries of FEV by SMOKE indicate some interesting features.

Variable | smoke | N | Mean | Median | TrMean | StDev |

fev | No | 589 | 2.5661 | 2.4650 | 2.5229 | 0.8505 |

Yes | 65 | 3.2769 | 3.1690 | 3.2666 | 0.7500 | |

Variable | smoke | SE Mean | Minimum | Maximum | Q1 | Q3 |

fev | No | 0.0350 | 0.7910 | 5.7930 | 1.9195 | 3.0490 |

Yes | 0.0930 | 1.6940 | 4.8720 | 2.7770 | 3.7680 |

Figure 1 shows side-by-side boxplots of FEV for smokers and non-smokers and exhibits these summaries more dramatically. In particular, the summaries indicate that, on average, smokers have higher FEV (i.e. stronger lung function) than nonsmokers.

Figure 1. Two-sample comparison of FEV for smokers and nonsmokers.

Of course, the issues mentioned earlier must be addressed. First, this is an observational study where the subjects self-select the smoking or nonsmoking groups. Do we really believe that smoking, in and of itself, is associated with stronger lung function? Second, subjects self-report smoking status. How trustworthy is this variable? Third, how long or hard could a 10-year-old have smoked? Fourth, is FEV adjusted for body size? Once the students think about these sorts of issues they are asked to consider other variables' relationships to FEV and whether these relationships confound or modify the relationship between smoking and FEV. This leads to discussions of various conditional summary statistics and, most effectively, some multivariable summaries and models. Figure 2 and Figure 3 provide two interesting examples.

Figure 2. Comparison of FEV for smokers and nonsmokers, accounting for age.

Figure 3. Comparison of FEV for smokers and nonsmokers, accounting for height.

Notice the use of a scatterplot smoother, namely *lowess*
(Cleveland, 1979). Smoothers are relatively old tools that ought to be
standard parts of introductory statistics courses by now, even if the methods' subtleties cannot be explored in detail.
Virtually no introductory (even few upper-level undergraduate) texts consider these to be as worthwhile as simple
scatterplots and automated computation of correlation coefficients. This seems a regressive mindset since including some
notion of trend in the scatterplot allows one to assess the “linearity” of the co-relationship (which is what the
correlation coefficient is intended to summarize) and avoids letting the users' eyes simply imagine the possible
relationship. Smoothers, such as lowess, typically yield richer interpretations of the conditional average value of the
response as a function of the explanatory input(s) than the correlation coefficient (or a simple, linear regression line).

Briefly, note that Figure 2 provides graphic contrast to Figure 1. Though there is much statistical inference work to be done, Figure 2 now suggests that older subjects are associated with larger FEV values (no big surprise) and if we compare, say, the 15-year-old smokers' FEV values to the 15-year-old nonsmokers' FEV values, we see that, generally, the smokers' have lower FEV values than non-smokers. Further, this difference appears to be larger (bigger FEV values for nonsmokers) as we consider such conditional analyses for older subjects. Though it may be beyond the scope in a typical introductory course, the plot suggests a smoking-by-age interaction.

Interestingly, Figure 3 provides yet another contrast from Figure 1 and Figure 2. First, in contrast to the story from Figure 2, Figure 3 suggests that, generally, the difference between the smokers' and nonsmokers' FEV values appears nominal when accounting for height. Further, the average FEV value as a function of height appears curved, possibly quadratic.

It should be noted that students rightfully and inevitably begin discussions of considering subsets of the data (“do we really need to consider 5-year-olds?”). Given well-known stories such as the analyses of the Challenger's risk of O-ring failure as a function of temperature, this becomes an interesting discussion. (Recall that in a comparison of temperature at liftoff with O-ring failures, a subset of launches for which there were zero O-ring failures was omitted when analyzing the data from shuttle launches prior to the Challenger's liftoff, leading to a considerably different conclusion than when these data were included.) Some students wish to omit the youngest subjects arguing that they are irrelevant to any analyses concerning smoking. Other students wish to retain all subjects countering that these subjects provide important information concerning the relationship between FEV and the other subject-specific measurements.

There are also opportunities for meaningful discussions about other topics. For example, many students want to describe the relationships observed in this cross-sectional study in terms of a single individual over time (see discussions of the regression models below) so that discussions of longitudinal versus cross-sectional studies are timely. Also, one can discuss adjusting FEV for age or height before assessing the relationship between FEV and smoking. The study design for these data requires that the adjustment be made post-sampling, through the use of regression models, but can elicit discussion concerning how to control for factors such as age, gender or height by the use of different sampling schemes. For example, one might imagine matching smokers and nonsmokers on age and gender, possibly even height, using a matched pairs design.

Once the principle scientific questions, the data production and the summaries of these data are considered, a reasonable story can be told as to what the data suggest regarding the scientific questions of interest. But, with some basic models (i.e. introductory-level models) one can do better. First, consider a regression model to adjust for the effects of age, height, or gender before assessing the relationship between smoking and FEV. For example, we see that a polynomial (e.g. quadratic) regression model in height fits FEV well. We might also consider interactions between any of these variables. In fact, the scatterplot in Figure 2 suggests a gentler slope for the smokers, as well as a larger intercept, so that a smoking-by-age interaction term appears warranted. All of these models lead to better estimates for quantifying the adjusted relationship between smoking and FEV. The results are sensible and yield more precise quantitative interpretations then any of the more naďve summaries can possibly yield.

Consider a couple of possible models. First, just to show that we can use regression to do a two-sample t-test (even if it is inappropriate for drawing inference in this observational study),

The regression equation is

fev = 2.57 + 0.711 smoke

Predictor | Coef | SE Coef | T | P |

Constant | 2.56614 | 0.03466 | 74.04 | 0.000 |

smoke | 0.7107 | 0.1099 | 6.46 | 0.000 |

S = 0.8412 | R-Sq = 6.0% | R-Sq(adj) = 5.9% |

Recall that the variable smoke is 0 for non-smokers and 1 for smokers. Taken at face value, this model indicates that, on
average, smokers have about a 0.7 liter larger FEV than nonsmokers (95% CI for the difference in means is approximately 0.5
to 0.9 liters). Note that R^{2} is 0.06. We leave it to the reader to fit regression models for FEV on AGE, FEV on
HT and FEV on GENDER. The respective R^{2} values for these three models are 0.572, 0.754 and 0.043. It makes sense that
height, a surrogate for body-size, is most highly associated with FEV, a surrogate for lung capacity. Further note that as
Figure 3 suggests, a quadratic regression of FEV on height and height2 fits even better.

In accordance with Figure 2, we fit a model that adjusts for AGE.

The regression equation is

fev = 0.367 + 0.231 age - 0.209 smoke

Predictor | Coef | SE Coef | T | P |

Constant | 0.36737 | 0.08144 | 4.51 | 0.000 |

age | 0.230605 | 0.008184 | 28.18 | 0.000 |

smoke | -0.20899 | 0.08075 | -2.59 | 0.010 |

S = 0.5651 | R-Sq = 57.7% | R-Sq(adj) = 57.5% |

This provides an estimate of the average difference between smokers' and nonsmokers' FEV conditional on AGE. For smokers and nonsmokers of the same age we expect, on average, that the nonsmokers' FEV is .2 liters larger than the smokers' FEV; a 95% confidence interval for the mean improvement in FEV, conditional on age, is approximately 0.05 to 0.37 liters.

The regression equation is

fev = 6.89 + 0.07 age + 0.10 sex - 0.27 ht + 0.003 ht.sq - 0.13 smoke

Predictor | Coef | SE Coef | T | P |

Constant | 6.895 | 1.499 | 4.60 | 0.000 |

age | 0.069465 | 0.009109 | 7.63 | 0.000 |

sex | 0.09454 | 0.03286 | 2.88 | 0.004 |

ht | -0.27423 | 0.04968 | -5.52 | 0.000 |

ht-sq | 0.0031251 | 0.0004086 | 7.65 | 0.000 |

smoke | -0.13321 | 0.05711 | -2.33 | 0.020 |

S = 0.3951 | R-Sq = 79.4% | R-Sq(adj) = 79.2% |

Recall that sex is coded 0 for females and 1 for males. This model assesses the expected difference between smokers' and nonsmokers' FEV conditional on their age, gender and height, using a quadratic polynomial in height. The effect size and its variance are somewhat smaller than in the model that assesses this effect solely conditional on AGE. Questions arise about other lurking variables and the extent to which they might modify the relationship between FEV and smoking.

Again, the fact that age is a significant confounding variable often leads to a mistaken longitudinal interpretation (i.e. based on model 2, for every one-year increase in a child's age you would expect about a 0.2 liter increase in FEV). This is a useful place to reemphasize the difference between cross-sectional and longitudinal studies and to require an accurate (if a bit pedantic) interpretation of the age effect (i.e. a person from this population one-year older than another person from this population is expected to have 0.2 liters higher FEV). (As a side note, in Tager, et al (1983) there is a lovely discussion of a longitudinal analysis that, indeed, provides a longitudinal estimate of the increase in a child's FEV over a 1-year period. Together these settings give real substance and context for the contrast between longitudinal and cross-sectional studies.)

Finally, I will show the results of two models that include a smoking-by-age interaction, leaving various interpretations for readers to develop. Simply note that in the presence of gender and height (entered quadratically), the smoking-by-age interaction term is not significant.

The regression equation is

fev = 0.253 + 0.243 age + 1.94 smoke - 0.163 age.smoke

Predictor | Coef | SE Coef | T | P |

Constant | 0.25340 | 0.08265 | 3.07 | 0.002 |

age | 0.242558 | 0.008332 | 29.11 | 0.000 |

smoke | 1.9436 | 0.4143 | 4.69 | 0.000 |

age*smoke | -0.16270 | 0.03074 | -5.29 | 0.000 |

S = 0.5537 | R-Sq = 59.4% | R-Sq(adj) = 59.2% |

The regression equation is

fev = 7.06 + 0.0745 age + 0.0979 sex - 0.280 ht + 0.00315 ht-sq + 0.256 smoke
- 0.0295 age.smoke

Predictor | Coef | SE Coef | T | P |

Constant | 7.060 | 1.504 | 4.69 | 0.000 |

age | 0.074508 | 0.009920 | 7.51 | 0.000 |

sex | 0.09789 | 0.03295 | 2.97 | 0.003 |

ht | -0.27954 | 0.04983 | -5.61 | 0.000 |

ht-sq | 0.0031536 | 0.0004090 | 7.71 | 0.000 |

smoke | 0.2555 | 0.3089 | 0.83 | 0.408 |

age*smoke | -0.02949 | 0.02303 | -1.28 | 0.201 |

S = 0.3949 | R-Sq = 79.4% | R-Sq(adj) = 79.3% |

As can be seen from model 4, when just considering age and smoking, there is indeed a significant interaction between the two, yielding a steeper slope in the relationship between FEV and age for nonsmokers than for smokers (as we saw in Figure 2). However, model 5 indicates that the age-by-smoking interaction is not significant in the presence of height (entered quadratically).

None of these models is perfect. Readers are encouraged to fit other models and analyse these data further. In an upper-level class, one might even make use of these data to discuss multicollinearity, selection criteria such as AIC and BIC, or even model averaging techniques (Raftery, Madigan and Hoeting 1997).

Columns | Label | Description |

1-3 | age | discrete, positive integer (years) |

5-10 | fev | continuous measure (liters) |

12-15 | ht | continuous measure (inches) |

19 | sex | discrete, Female coded 0, Male coded 1 |

25 | smoke | discrete, Nonsmoker is 0,Smoker is 1 |

Cleveland, W.S. (1979), “Robust locally weighted regression and smoothing scatterplots,” *Journal of the American
Statistical Association*, 74, 829-836.

Dalal, S., Fowlkes E., and Hoadley B. (1989), “Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure,”
*Journal of the American Statistical Association*, 74, 945-957.

Kahn, M. (2003), “Data Sleuth, *STATS*, 37, 24.

Minitab Release 13.31, Minitab Inc., State College, PA.

R Development Core Team (2004), “R: A language and environment for statistical computing,” www.R-project.org.

Raftery, A.E., Madigan, D. and Hoeting, J.A. (1997), “Bayesian Model Averaging for Regression Models”, *Journal of the
American Statistical Association*, 92, 179-191.

Rosner, B. (1999), *Fundamentals of Biostatistics*, 5th ed., Pacific Grove, CA: Duxbury.

Ruppel, G. (1997), *Manual of Pulmonary Function Testing*, 7th ed., St. Louis, MO: Mosby Year Book.

Tager, I., Weiss, S., Munoz, A., Rosner, B., and Speizer, F. (1983), “Longitudinal Study of the Effects of Maternal Smoking
on Pulmonary Function,” *New England Journal of Medicine*, 309(12), 699-703.

Tager, I., Weiss, S., Rosner, B., and Speizer, F. (1979), "Effect of Parental Cigarette Smoking on the Pulmonary Function
of Children," *American Journal of Epidemiology*, 110(1), 15-26.

Michael Kahn

Department of Mathematics and Computer Science

Wheaton College

Norton, MA

U.S.A.
*mkahn@wheatonma.edu*

Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications