Journal of Statistics Education Volume 14, Number 3 (2006), www.amstat.org/publications/jse/v14n3/datasets.winner.html
Copyright © 2006 by Larry Winner, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Kendall’s ; Matched pairs; Ordinal data; Spearman’s ; Sports statistics.
The Winston Cup series is currently made up of 36 races per year, with 43 cars competing in each race (both of these numbers have varied over the years). The series generates a rich set of data and possibilities for comparisons among drivers/crews. Students throughout the country have been exposed to NASCAR through national telecasts of races, as well as many promotional activities among the drivers. Interested students could “mine” the data and come up with many questions to answer as well as many ways to graphically describe the data.
The most important outcomes of each race is the driver’s finishing position and prize winnings. The first driver to cross the start/finish line on the last lap of the race wins the race. Once the winner has crossed the finish line (received the checkered flag), all other drivers remaining on the track complete their current lap. Second place goes to the second car on the “lead lap” to cross the finish line. Thus, all cars on the lead lap who cross the finish line will have completed the maximum number of laps. The finishing positions are ordinal outcomes as opposed to quantitative outcomes. It is not uncommon for over half the cars to finish within a lap or two of each other while several cars may only complete a handful of laps. Also, in terms of prize money, drivers who may finish within a lap of each other may receive vastly different prizes. Students typically are exposed quickly in any introductory course to the concept of interval/ratio scaled outcomes (prize money) and ordinal measures (finish position). They may be shocked to observe the differences in prize money among drivers who finish within very short distances of one another in a long race.
The fact that the same driver/crew teams participate each week (for the most part) allows for some interesting statistical questions to be posed. Unlike golf, where many of the top competitors play selective schedules, virtually all top teams compete in each race of the season. Throughout this paper, we will simply identify the driver/crew team by the driver.
Instructors may assign their students many different events of which to obtain (empirical) probabilities, as well as give them the potential to directly think through the relevant steps for obtaining conditional probabilities of interest. Further, students may be exposed to many applications of Bayes’ Rule, and obtain various probability distributions. An in-lab computer project is supplied that lets students obtain various probabilities by sifting through raw data with a spreadsheet. Similar projects could be used to obtain any number of descriptive statistics or graphical displays. Students may use statistical methods to explore the growth of NASCAR’s popularity based on payouts to drivers, make head-to-head comparisons of drivers’ racing skills, as well as to compute and compare rank correlation coefficients between starting and finishing positions. The racing skills of pairs of drivers can be compared by using simple methods of categorical data analysis for matched pairs, such as McNemar’s Test (McNemar, 1947). While many students may not have been exposed to this method, they typically have compared pairs of means and proportions for independent samples, and means based on paired samples; this gives instructors a means to fill in a void, using a very simple statistical test applied to real data. By the nature of the ranking of the starting and finishing positions, students can compare two rank correlation coefficients: Spearman’s (Spearman, 1904) and Kendall’s (Kendall, 1938). While most methods courses describe Spearman’s measure, many students have not been exposed to Kendall’s method, and this gives them an opportunity to compare two “competing” measures. The datasets afford instructors and students a wide range of possibilities to apply methods of descriptive and inferential statistics over various levels of sophistication (all of which are becoming more accessible to students, at least conceptually and computationally).
The driver dataset nascard.dat.txt contains the finishing and starting positions for each driver in 898 Winston Cup races between 1975 and 2003. Also included are the driver’s name, prize winnings for that race, number of laps completed, and car make. The dataset contains 34,884 observations at the driver level. Note that prize winnings are not necessarily monotone decreasing in finish position.
A second dataset nasarr.dat.txt contains race specific characteristics. It contains the series race (1,…,898), year (1975-2003), race within the year, number of cars, total payout, Spearman’s , Kendall’s , track length, laps completed, road track indicator, number of caution periods, number of lead changes, time to complete race, Consumer Price Index (CPI-U) for the month of the race, latitude/longitude coordinates, and track name. Note that the race distance can be obtained by multiplying the number of laps by the track length (this allows for races that were shortened due to weather). Also, average speed for the winning driver can be obtained by dividing miles by completion time (minutes) and multiplying by 60 (minutes/hour).
Data were obtained from the NASCAR website (www.nascar.com), as well as racing-reference.com. Information is given on these websites regarding all Winston Cup races between 1975 and 2003, and beyond. Information regarding the tracks participating over this period was obtained from these websites, and web searches for information on tracks not currently participating in Winston Cup racing. Due to rapidly changing corporate sponsorship, race names are not included in the datasets.
Students may be asked to manage (sort, select special cases within, and/or create new variables from) the datasets to obtain the specified probabilities, or the instructor may prepare worksheets or a program to select the cases for the students to obtain relevant probabilities. We feel that a combination of students doing the operations and instructor pre-preparation of some datasets will maximize the opportunities to obtain a wide range of probabilities. We have recently attempted this in a small computer lab with 24 students working in pairs, but there is no reason it could not be given as home exercises in larger classes (assuming availability of software). The students’ thought processes on determining the order of sorting to get the appropriate conditional probabilities was interesting to observe. The Appendix contains the in-lab project assignment/instructions.
|Year||Payout ($1Ms)||% Change||Adjusted Payout ($1Ms)|
Figure 1: Total Payout by Year (Millions of Dollars adjusted to 1982-1984).
Table 1 and Figure 1 depict the rapid growth in popularity of NASCAR over the past 30 years as measured by total payout. An average annual growth rate can be computed from the values in the % change column by taking the geometric mean of the growth rates, where the growth rate is computed from the multiplier: 1 + Gi = (1 + (% change/100)) for each year:
Thus the average annual growth rate (obtained in the manner that an average rate of return is computed in finance) is 15.7%. Students could be asked to compute these for adjusted dollars, or different races, or for individual drivers. Also, students can compare the geometric mean with the arithmetic mean or the median, which are less appropriate for describing growth rates.
Students can estimate trend lines for the payouts, assuming linear and exponential growth models and compare their fits (and sadly, may be disappointed). Correlation and regression are being taught earlier in introductory statistics courses, and many students are now being exposed to these methods prior to basic probability (e.g. Moore and McCabe, 2006). Most statistical software packages have options to fit these models. They may be asked to conceptually describe the relationship since it doesn’t appear to fit well to either model which place severe restrictions on growth (a combination of the two seems to fit well visually).
While virtually every statistics textbook covers comparisons of two means and proportions for independent samples and comparisons of means for paired samples, most do not fill in an obvious hole: comparisons of proportions for paired samples. A very simple procedure can be used to test for differences in proportions (McNemar, 1947). We describe the test and confidence interval that instructors can easily introduce to their students through this data.
For any pair of drivers (say A and B), we have a starting order (A ahead of B or B ahead of A) and a finishing order. If Driver A starts and finishes ahead of (or behind) B, then they completed the race in the same order. Because starting order generally represents the cars’ levels of performance for that weekend, we can’t say anything about the two drivers’ relative performances based on starting position. However, if Driver A starts ahead of B, and B finishes ahead of A, we might surmise that Driver B outperformed A in that race (or at least covered more ground). Likewise, if A started behind, but finished ahead of B, we could say A outperformed B.
Students can conduct a test to compare proportions based on matched pairs (see Agresti, 2002, Chapter 10 or Agresti, 1996, Chapter 9). The basic idea is to set up a 2x2 table with the driver who started the race ahead forming rows and the driver who finished ahead in the columns. Table 2 shows the general form and notation.
|Start||A ahead||B ahead||Total|
Students can test whether the two drivers’ race abilities differ, where is the (true) probability that A starts ahead of B and is the probability that A finishes ahead. Defining , we could say that the two drivers’ racing skills are equal if , that is, the probability that driver A beats B is equal to the probability that driver A starts ahead of B. If , then driver A tends to outperform B on the track; if , B outperforms A.
The following statistic can be used to test whether (see Agresti, 2002, p. 411 or Agresti, 1996, p. 228):
This test statistic is the signed square root of McNemar’s chi-square statistic (McNemar, 1947). For large samples, this statistic is approximately normal. An exact test can be conducted based on the binomial distribution, where nBA is distributed Binomial with n = nBA + nAB and p = 0.5 under the hypothesis of no driver skill difference. Based on the normal approximation, values of above are evidence in favor of A being the better of the two drivers in racing conditions, values less than – provide evidence that B is better.
This allows for students to make use of multiple comparisons as well. Suppose they would like to make pairwise head-to-head comparisons among k drivers. Then, they can see they will be making pairwise comparisons among C = k(k - 1)/2 pairs of drivers. If they wish to keep the experimentwise error rate at level , they can use Bonferroni’s (conservative) method, and make each individual comparison at .
We demonstrate by making pairwise comparisons among the following set of drivers: Dale Earnhadt (Sr.), Jeff Gordon, Darrell Waltrip, Terry Labonte, and Bill Elliott. Students can be assigned different pairs of drivers, or choose pairs of drivers they are familiar with, and be asked to conduct the test for their pair(s). First, students must obtain datasets containing all races for each driver, then merge (side-by-side) the datasets for each pair by race, including only races that both drivers competed in. Also, note that start and finish variables must be labeled differently for the 2 drivers (e.g. startde, finishde, startjg, and finishjg when comparing Dale Earnhardt and Jeff Gordon). This gives students a challenging problem in managing and combining large datasets (without the risk of permanently damaging or destroying them). Table 3 gives the results for all C = 10 pairs of drivers. The critical value, based on Bonferroni’s method, with = 0.05 is = 2.81 Also included are simultaneous 95% confidence intervals for the differences, . The estimate d of and its estimated standard error can be computed as (see e.g. Agresti, 2002, pp. 410 - 411 or Agresti, 1996, pp. 227 - 229, although the notation for standard error is given in different forms):
Note that the standard error for the confidence interval does not place the constraint that the true proportions are equal, and is more complicated than that for the test. Students may be asked how this is analogous to the case for independent samples.
We make the following conclusions (with an experimentwise Type I error rate of 0.05):
We can summarize the results by ordering the drivers and joining pairs of drivers who do not differ significantly with lines.
JG BE DW TL DE
|Driver A||Driver B||nAA||nAB||nBA||nBB||95% CI for|
We treat this as a random variable in the sense that today’s race is one realization of a conceptual population of races that could have been run. This quantity has been computed in the nascarr.dat.txt dataset, but can be directly computed from the full dataset nascard.dat.txt. Students can compute this statistic on their own and also observe the empirical distribution of this statistic in repeated samples. Further, students may try to “explain” the variation in this measure (and Kendall’s below) by fitting a regression model, relating the correlation measure(s) to: track length, numbers of laps, caution flags, lead changes, and drivers. Students can be challenged by asking that if their goal was predicting the measure prior to the race beginning, which of these predictors should be used in the model. They may also compare the fits of the two models.
Kendall’s has also been computed for each race. For a given race, there are n(n - 1)/2 pairs of drivers. Beginning with the driver who finished first, we count how many drivers started behind him/her, then we proceed to driver 2, and see how many drivers that finished behind him/her started behind him/her and so on (Kendall 1938, Kendall and Gibbons, 1990). The total count will be called k. Thus, if a driver who won the race had started first, (s)he would contribute n - 1 to k, while if a driver who won had started last, (s)he would contribute 0 to k. Then, for a race with n drivers, we have:
Note that if the drivers end in the exact order they start, k = (n - 1) + (n - 2) + ... + 1 = (n - 1)n/2 and Kendall’s takes on the value 1, similarly, if drivers perfectly reversed their order it would take on –1.
Tests of independence between starting and finishing position can be conducted based on both Spearman’s and Kendall’s . The test statistics (based on no ties among the starting or finishing positions) are:
Both statistics are approximately standard normal for large samples when there is no association. If we use these to test for each race whether there is a positive association between starting and finishing position, based on = 0.05 significance level (concluding there is a positive association if ) we obtain the following results in Table 4.
|Spearman||Positive Association||No Association|
|Positive Association||637 (70.9%)||12 (1.3%)|
|No Association||12 (1.3%)||237 (26.4%)|
Thus, they virtually always give the same conclusion regarding association between starting and finishing positions. Note that students could apply McNemar’s test here to determine whether one measure is more/less likely to conclude there is a positive association than the other.
A plot of Spearman’s and Kendall’s across time is given in Figure 2, where we combine the measures over each year, treating races as blocks (Taylor, 1987). We average the measures over each year with weights equal to the number of cars in the race. Note that the level of correlation between starting and finishing positions appears to have dropped off quite a bit since the mid 1990s, possibly due to increased competition among teams and more money being spent on equipment as the payouts have grown. Students may think of alternative explanations of this and further investigate it, as many rules changes and changes in equipment have been made over the years.
Figure 2: Plot of (Weighted) Averag Rank Correlations versus Year.
As a result, we have 898 pairs . Students can compute Pearson’s product moment coefficient of correlation as (where : and are the sample means for each measure):
For this series, the correlation coefficient is r = .9908. Thus (not surprisingly) there is a strong correlation between these two rank correlation coefficients. A scatterplot of the rank correlations is given in Figure 3. While these measures are based on different criteria, their levels are very highly correlated across samples. Students could empirically obtain the sampling distribution of the correlation coefficient r when the correlation is high, by taking many random samples of races and observe its distribution of sample values.
Figure 3: Plot of Spearman’s Rho versus Kendall’s Tau.
In this paper, we have introduced datasets containing results from all NASCAR races from 1975-2003 inclusive at the driver and race levels. Examples have been chosen to demonstrate activities for students that involve: obtaining basic and conditional probabilities; describing growth in payouts in real terms, percent changes, and adjusted terms; learning to conduct a simple test for proportions based on paired samples; and making use of the ordinality of start and finish positions to work with two measures of rank correlation. A series of other potential applications is also offered to instructors and students. We feel with the growing popularity of NASCAR among both males and females, these datasets would be of interest to statistics, economics, and math instructors and their students.
The file nascarr.dat.txt is a text file containing 898 rows. Each row corresponds to a particular race in a particular year. The file nascarr.txt is a documentation file describing the variables.
|1 - 3||Series Race||1, 2, ... ,898|
|6 - 9||Year||1975, ..., 2003|
|12 - 13||Race/Year||Format F2.0|
|16 - 17||Finishing Position||Format F2.0 (1=Winner)|
|20 - 21||Starting Position||Format F2.0|
|24 - 26||Laps Completed||Format F3.0|
|29 - 35||Winnings||Format F7.0 (In dollars)|
|38 - 39||Number of cars in race||Format F2.0|
|42 - 50||Car Make||String of Length 9|
|53 - 82||Driver||String of Length 30|
|1 - 3||Series Race||1, 2, ... ,898|
|6 - 9||Year||1975, ..., 2003|
|12 - 13||Race/Year||Format F2.0|
|16 - 17||Number of cars in race||Format F2.0|
|20 - 26||Total race payout||Format F7.0|
|29 - 33||Monthly CPI-U||Format F5.2|
|36 - 42||Spearman’s||Format F7.4|
|45 - 51||Kendall’s||Format F7.4|
|54 - 58||Track Length||Format F5.3 (miles)|
|61 - 63||Laps Completed by winner||Format F3.0|
|66||Road Indicator||1=Road Course, 0=Loop|
|69 - 70||Caution Flags||Format F2.0|
|73 - 74||Lead Changes||Format F2.0|
|78 - 83||Winning Time||Format F6.2 (minutes)|
|86 - 90||Track Latitude||Format F5.2|
|93 - 98||Track Longitude||Format F6.2|
|101 - 103||Track Code||String of length 3|
|106 - 141||Track Name||String of length 36|
Worksheet 1: All races held from 1975-2003 (race level data):
Probability there were less than or equal to 3 caution flags ________________ Probability there were more than 10 lead changes _____________________ Probability there were at least twice as many lead changes as cautions ____________ Probability the average speed was over 150 miles per hour __________________ Probability the average speed was below 100 miles per hour _________________
Worksheet 2: All drivers participating in Daytona 500 races from 1975-2003
Probability the driver who started first finished first _______________________ Probability the driver who started first finished in top ten _______________________ Probability the driver who finished first started first _______________________ Probability the driver who finished first started in top ten _______________________
Worksheet 3: All Drivers starting first and second (side-by-side format)
Probability that driver starting first beats driver starting second._______ Probability that the first driver beats the second given track length is 1 mile ________ Probability that the first driver beats second given track length 2.0 miles____________ Probability driver starting first drove a Ford _________________ Probability driver starting first drove a Chevy ____________________
Worksheet 4: All Drivers finishing first and second (side-by-side format)
Probability that driver finishing first started ahead of driver finishing second._______ Probability that event described above occurred given race length 350 miles _______ Probability driver finishing first drove a Ford _________________ Probability driver finishing first drove a Chevy ____________________
Notes on variables:
Race Length (Miles)= Laps completed by winner x Track length
Completion Time is measured in minutes, divide by 60 to change to hours
Use these to compute speeds in miles per hour
Agresti, A. (2002), Categorical Data Analysis, 2nd Ed., Hoboken, New Jersey: Wiley.
Fotheringham, A.S., Brunsdon, C., and Charlton, M. (2000), Quantitative Geography, London: Sage.
Kendall, M.G. (1938), “A New Measure of Rank Correlation,” Biometrika, 30, 81-93.
Kendall, M. and Gibbons, J.D. (1990), Rank Correlation Methods, thEd., London: Edward Arnold.
Mansfield, E. (1999), Managerial Economics, 4th Ed., New York: W.W. Norton.
McNemar, Q. (1947), “Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages,” Psychometrika, 12, 153-157.
Moore, D.S. and McCabe, G.P. (2006), Introduction to the Practice of Statistics, 5thEd., New York: W.H. Freeman.
NASCAR Record & Fact Book (2004 Ed.), St. Louis, MO.: Sporting News Books.
Spearman, C. (1904), “The Proof and Measurement of Association Between Two Things,” American Journal of Psychology, 15, 72-101.
Taylor, J.M.G. (1987), “Kendall’s and Spearman’s Correlation Coefficients in the Presence of a Blocking Variable,” Biometrics, 43, 409-416.
Department of Statistics
University of Florida
Gainesville, FL 32611-8545
Volume 14 (2006) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications