NAME: Basketball free throws (sequence of successes and failures)
TYPE: Time series of discrete dichotomous data
SIZE: 200 observations, one variable
DESCRIPTIVE ABSTRACT:
This is a simple sequence of dichotomous observations from a series
of 200 consecutive basketball free throw attempts.
SOURCES:
I collected the data myself.
VARIABLE DESCRIPTIONS:
S = success (free throw made), F = failure (free throw missed)
SPECIAL NOTES:
(none)
STORY BEHIND THE DATA:
I teach a Biostatistics class and I like to occasionally give the
students a data set for statistical analysis before they have learned
the appropriate test. I wanted to obtain a simple data set involving
a time sequence of dichotomous data, because there are a variety of
questions that can be asked of such a data set. In our campus
gymnasium I attempted 200 consecutive basketball free throw shots
under standard conditions (basket rim 18 inches in diameter and 10
feet above the ground; horizontal distance from free throw line to
basket 15 feet; leather basketball 9 inches in diameter). I
recorded whether each shot was a success or failure.
PEDAGOGICAL NOTES:
I use these data for an in-class exercise in my undergraduate
Biostatistics class, about halfway through a 14-week semester.*
The students work in groups of 3 or 4 for a full 50-minute class
period. I give them a 1-page handout that shows the data along with
the following 4 questions to guide them:
1. How might you determine whether there was a pattern to this sequence?
2. What is your null hypothesis?
3. Can you think of a statistic whose value would vary depending on
whether there was a sequential pattern?
4. Is there more than one type of non-random, sequential pattern one
could observe with data like these? (i.e., is there more than one
way the data could deviate from a random sequence?)
I don't need to provide any motivation for the students; they dive
right in. While students work I circulate around the room to field
questions, discuss ideas, and gently steer them down more productive
avenues. I strongly encourage students not to consult their texts
(we use Zar 1998) unless they need to look up a formula or table of
critical values that they already are familiar with. (This is so they
do not immediately discover and apply the runs test before they get
an opportunity to thoroughly think about the data.)
Towards the end of the class session, each group of students briefly
describes the approach they used. Most students do not complete
their analysis during the class period. Typically, students have
tried diverse analyses. Some of the common approaches that students
come up with are:
1. Analyzing whether the data are consistent with a success rate of
50% (using a binomial test, usually with a normal approximation).
Although common, this approach does not address the main questions.
2. Calculating the probability of streaks of a given length (either
makes or misses), again using the binomial distribution. Somehow
they have to relate these individual calculations to the overall
expected distribution under the null hypothesis that each free throw
is independent. This is difficult for them.
3. Analyzing whether the success rate varies in time. Students do
this by breaking up the data into blocks of 10 or 20 observations,
then testing for heterogeneity of successes vs. failures among blocks
(as it turns out, there is significant heterogeneity by a G-test or
chi-square test). This is a peripheral issue but is straightforward
to analyze, and is also interesting to connect to the issue of
streaks.
4. I've never had students come up with the simple concept of runs
(streaks of 1 or more consecutive misses or makes, without
distinguishing between long and short streaks) on their own.
However, students sometimes develop more complicated versions of this
idea, so I can steer them towards the simpler idea of runs, and how
the number of runs would make a convenient test statistic.
5. On rare occasions a student has realized that each pair of
successive observations could be considered as a transition
(make-make, make-miss, miss-make, and miss-miss), and that the
numbers of each type of transition are informative about whether the
data are sequentially independent. Using this method, one constructs
a table of the observed frequencies of the four transitions and
compares it to the expected frequencies under the null hypothesis
that each observation is independent. Thus, this data set can be
used to teach students a little bit about the simplest type of Markov
process.
During the following class period, we revisit this data set as I
illustrate the various ways one could analyze it. I present the runs
test, a contingency-table analysis of transition frequencies, and a
contingency-table test for heterogeneity in time. I try to connect
each of these analyses to the approaches students had come up with on
their own during the previous class. This leads to a broader
discussion of the different ways one can analyze certain data sets.
When we discuss the runs test, it also provides an opportunity to
discuss where tables of critical values came from. I use a simple
MATLAB simulation to generate a null distribution of runs using the
same data set, and we find that the observed number of runs is very
close to the median of this null distribution. This provides an
example of the value of randomization approaches, which I like to
emphasize often in my course.
* Note on student background: at this point the students are
familiar with standard introductory concepts in frequentist
statistics, including estimation, sampling distributions, and
hypothesis testing, and have done both parametric and non-parametric
analyses of continuous and discrete data. They have worked with
one-, two-, and paired-sample tests. They are also familiar with the
concept of randomization tests, which I introduce within the first
week of the course because the students have found that it gives them
a much better understanding of what a P value means. Immediately
prior to this exercise, the students have been analyzing data
involving discrete categorical variables, including work with
binomial and Poisson distributions, R X C contingency tables, and
goodness-of-fit tests. We do this exercise before the students have
learned any methods for analyzing temporal sequences of data (e.g.,
runs tests) because I want them to experience this as a novel problem
that they are unfamiliar with.
REFERENCES:
Zar, Jerrold H. 1998. Biostatistical Analysis. 4th ed. Prentice Hall
SUBMITTED BY:
Stephen C. Adolph
Department of Biology
Harvey Mudd College
301 Platt Blvd.
Claremont, CA 91711 USA
adolph@hmc.edu