Journal of Statistics Education v.2, n.1 (1994)
Copyright (c) 1994 by James A. Hanley and Stanley H. Shapiro, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words: Experiment; Longevity; Analysis of covariance; Regression; Precision; Survival analysis.
This dataset contains observations on five groups of male fruitflies -- 25 fruitflies in each group -- from an experiment designed to test if increased reproduction reduces longevity for male fruitflies. (Such a cost has already been established for females.) The five groups are: males forced to live alone, males assigned to live with one or eight interested females, and males assigned to live with one or eight non-receptive females. The observations on each fly were longevity, thorax length, and the percentage of each day spent sleeping. The structure of the experiment provokes lively discussion on experimental design and on contrasts, and gives students opportunities to understand and verbalize what we mean by the term "statistical interaction." Because the variable thorax length has a strong effect on survival, it is important to take it into account to increase the precision of between-group contrasts, even though it is distributed similarly across groups. The dataset can also be used to illustrate techniques of survival analysis.
1 Data projects designed to give students experience with multiple regression and allied techniques often involve so many variables that some of the basic ideas in analysis of variance and covariance are overlooked. This dataset, which we assembled entirely from three panels in Figure 2 of an article entitled "Sexual Activity and the Lifespan of Fruitflies" (by Linda Partridge and Marion Farquhar, published in Nature, 1981), has allowed students to `stay with the basics' and learn, by doing, the principles involved in these analysis techniques. Below we describe the background to the data and the main uses we have made of the dataset.
2 A cost of increased reproduction in terms of reduced longevity has been shown for female fruitflies, but not for males. The flies used in this study were an outbred stock. Sexual activity was manipulated by supplying individual males with one or eight receptive virgin females per day. The longevity of these males was compared with that of two control types. The first control consisted of two sets of individual males kept with one or eight newly inseminated females. Newly inseminated females will not usually remate for at least two days, and thus served as a control for any effect of competition with the male for food or space. The second control was a set of individual males kept with no females. There were 25 males in each of the five groups, which were treated identically in number of anaesthetizations (using CO2) and provision of fresh food medium.
3 `Compliance' of the males in the two experimental groups was documented as follows: On two days per week throughout the life of each experimental male, the females that had been supplied as virgins to that male were kept and examined for fertile eggs. The insemination rate declined from approximately 7 females/day at age one week to just under 2/day at age eight weeks in the males supplied with eight virgin females per day, and from just under 1/day at age one week to approximately 0.6/day at age eight weeks in the males supplied with one virgin female per day. These `compliance' data were not supplied for individual males, but the authors say that "There were no significant differences between the individual males within each experimental group."
4 One of us came upon the article in Nature and was attracted by the way the raw data were presented in classical analysis of covariance style in Figure 2. There were three panels, each one with thorax length on the x-axis and longevity on the y-axis. Panel A showed the data points for the 25 flies in the `live alone' group; Panel B showed the 25 data points for those living with `1 interested partner' and the 25 data points for those living with `1 not interested' partner; Panel C showed corresponding data for the flies living with eight partners. Panels B and C each showed a pair of parallel lines corresponding to the two groups displayed. We read the data points from the x-y graphs and brought them to the attention of a colleague with whom one of us was teaching the applied statistics course. The colleague thought that with only three explanatory variables (THORAX, plus PARTNERS and TYPE to describe the five groups), it would not be challenging enough as a data-analysis project; he suggested adding another variable. We added SLEEP, a variable not mentioned in the published article. Teachers can contact us about the construction of this variable. (We prefer to divulge the details of this variable only after students have completed their analysis.)
5 For each fruitfly, the dataset contains two variables that indicate to which of the five groups he was randomly assigned: PARTNERS is a numerical variable denoting the number of companions (0, 1 or 8), while TYPE indicates whether the companion was a newly pregnant female, a virgin female, or non-existent (if PARTNERS=0). The two covariates are THORAX, the length of the thorax in mm, and SLEEP, the percentage of each day spent sleeping. Lastly, the response variable LONGEVITY is the lifespan, in days.
6 This has been the most successful and the most memorable dataset we have used in an "applications of statistics" course, which we have taught for ten years. This is a graduate level course for epidemiology and biostatistics students (and occasional statistics students). The students have already had a sequence of two or three biostatistics courses dealing with first principles and with regression methods for continuous and categorical response data. The main activity in the course is the analysis of datasets. Depending on class size, students work in groups of two or three; each group analyzes a different dataset. Weekly classes are devoted to a discussion of work in progress and a review of plans for the next week. We insist that each group present its thinking and its results to all students, and all students are expected to contribute to a discussion of what to do next.
7 Even before we begin any analysis, there is usually lively discussion about the primary contrast. The five fruitfly groups and their special structure allow opportunities for students to understand and verbalize what we mean by the term "statistical interaction."
8 The most common analysis techniques have been analysis of variance, analysis of covariance (for those who distinguish analysis of covariance from regression), and multiple regression. Because the variable THORAX is so strong (it explains about 1/3 of the variance in LONGEVITY), it is important to consider it to increase the precision of between-group contrasts. When students first check and find that the distributions of thorax length, and in particular, the mean thorax length, are very similar in the different groups, many of them are willing to say (in epidemiological terminology) that THORAX is not a `confounding' variable, and that it can be omitted from the analysis. We stress the importance of removing the noise added by this variable by asking students to consider only the "Experimental 1" and "Control 1" groups, where the difference of around eight days is not statistically significant at the conventional 0.05 level. However, when the variation in longevity associated with THORAX is accounted for, the difference is significant at the 0.01 level. The sharper p-value results mainly from the increased precision (lower SE) for the adjusted difference, and only slightly from the fact that the adjusted difference is somewhat larger than the unadjusted one.
9 We use this opportunity to get students to actually examine the difference in THORAX lengths between the groups and to calculate the adjustment. We find that students in epidemiology are very aware of the possibilities of biased comparisons when using largely non-experimental data. As a result, they tend to think that when covariates are closely balanced (as they are here in this randomized trial), they do not need to consider them further. Thus, they are inclined to overlook the use of covariates for `noise reduction'. See Hanley (1983), for a more detailed use of this dataset to explain the dual uses of regression methods to make comparisons both `fairer' and `sharper'.
10 One very observant student (now a professor) argued that THORAX cannot be used as a predictor or explanatory variable for the LONGEVITY outcome since fruitflies who die young may not be fully grown, i.e., it is also an `intermediate' variable. One Ph.D. student who had studied entomology assured us that fruitflies do not grow longer after birth; therefore, the THORAX length is not time-dependent!
11 There is also much debate as to whether one should take the SLEEP variable into account. Some students say that it is an intermediate variable. Some students formally test the mean level of SLEEP across groups, find one pair where there is a statistically significant difference, and want to treat it as a confounding variable. A few students muse about how it was measured.
12 There is considerable heteroscedasticity in the LONGEVITY variable. We discuss whether the heteroscedasticity introduces bias and whether it leads to incorrect estimates of the precision of fitted coefficients or of individual variability.
13 Students have recently analyzed the data using techniques from survival analysis. Some students have not considered survival analysis techniques because there are no censored observations. We point out that censoring is not a prerequisite; in fact, we have used the data to illustrate several techniques in survival analysis. (One could easily devise a way to introduce censoring of LONGEVITY if one wished.) It is important to point out that the standard error of the estimated coefficient representing a between-group contrast in a logistic or lifetable regression is not necessarily reduced (and certainly not dramatically so) when one adds a strong covariate like THORAX.
14 The file fruitfly.dat.txt contains the raw data. The file fruitfly.txt is a documentation file containing a brief description of the dataset.
Columns Description ------- ----------- 1- 2 Serial No. (1-25) within each group of 25 (the order in which data points were abstracted) 4 Number of companions (0, 1, or 8) 6 Type of companion 0: newly pregnant female 1: virgin female 9: not applicable (when PARTNERS=0) 8- 9 Lifespan, in days 11-14 Length of thorax, in mm (x.xx) 16-17 Percentage of each day spent sleeping
Hanley, J. A. (1983), "Appropriate Uses of Multivariate Analysis," Annual Review of Public Health, 4, 155-180.
Partridge, L., and Farquhar, M. (1981), "Sexual Activity and the Lifespan of Male Fruitflies," Nature, 294, 580-581.
James A. Hanley and Stanley H. Shapiro
Department of Epidemiology and Biostatistics
1020 Pine Avenue West
Montreal, Quebec, H3A 1A2
tel: +1 (514) 398-6270 (JH)
+1 (514) 398-6272 (SS)
fax: +1 (514) 398-4503