Copyright (c) 1993 by Robin H. Lock, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.
Key Words: Cars; Classroom data; Dataset; Introductory statistics.
The 93CARS dataset contains information on 93 new cars for the 1993 model year. Measures given include price, mpg ratings, engine size, body size, and indicators of features. The 26 variables in the dataset offer sufficient variety to illustrate a broad range of statistical techniques typically found in introductory courses.
1 The 1993 New Car data was inspired by a similar dataset for 1989 model cars which has been included among the sample data for the Student Edition of Execustat (PWS-KENT 1990). We have used Execustat's CARS89 data to demonstrate many points in both introductory and second level courses in applied statistics. In what follows we give a brief description of the updated and expanded 93CARS dataset and suggest several ways it might be used in class.
2 Data were obtained from two sources, The 1993 Cars - Annual Auto Issue from Consumer Reports and the PACE New Car & Truck 1993 Buying Guide . Passenger cars or vans that were included in both sources were eligible for selection. A random sample of models given in the PACE Buying Guide was chosen and matched to cars covered in the Consumer Reports issue until a desired sample size of 93 was reached. Vehicles in the Pickup Truck and Sport/Utility category were excluded due to incomplete information in the Consumer Reports article. We also avoided multiple inclusions of cars which were essentially the same model (such as the Dodge Shadow and Plymouth Sundance).
3 Each data case starts with the MANUFACTURER (e.g., Chevrolet, Audi, Honda,...), MODEL (e.g., Caprice, 90, Accord,...), and a TYPE (Small, Sporty, Compact, Midsize, Large, Van). The types were determined by the Consumer Reports classifications. The other 23 variables are all numeric. Three PRICE variables give a "minimum" cost for a basic model, a "maximum" for a model equipped with lots of options, and a "midrange" as the average of the two extremes. The EPA fuel efficiency ratings are given as both CITY and HIGHWAY miles per gallon (MPG).
4 Several measures reflect relative size and power of the standard engine. These include the number of CYLINDERS, engine displacement SIZE (in liters), maximum HORSEPOWER, and the revolutions per minute (RPM) at which the maximum power is achieved. A measure that might be less familiar to most students is the number of REVOLUTIONS of the engine needed for the car to travel one mile in its highest gear (automatic transmission).
5 Indications of each car's size are the LENGTH, WIDTH, WHEELBASE, U-TURN diameter, REAR seat room, LUGGAGE capacity, and size of the FUEL tank. Car weights differed somewhat between the two data sources. We used the WEIGHT given by Consumer Reports which included a full gas tank and air-conditioning, if available.
6 Other variables note the presence of standard AIR BAGS (driver or passenger), the type of DRIVETRAIN (front-wheel, rear-wheel, or all-wheel), and an option for a MANUAL transmission. A final variable categorizes the manufacturer as domestic (U.S.) or foreign, although this distinction is becoming less and less clear.
7 The only missing values are for CYLINDERS in the rotary engine Mazda RX-7, REAR SEAT room for the two-seaters (Corvette and RX-7), and LUGGAGE capacity for the vans and two-seaters.
8 A detailed key to the variables in the file can be found in the Appendix and the 93cars.txt file which is available in the data archives.
9 This is a multi-purpose dataset which can be used at many points in a course. We have often used Execustat's similar CARS89 data as an initial example for demonstrating the statistical package to students in the second week of an introductory course. This class typically is held in a classroom equipped with a computer and projection system, with the instructor "driving" the software. Despite having only studied some descriptive techniques, students are easily drawn into a discussion of the interesting features of the data. They tend to be familiar with most of the variables (and specific car models). They anticipate relationships between the variables, are quick to generate both questions and explanations, and enjoy guessing at the identity of outliers in the plots. Inevitably, the class period ends long before the stream of questions is exhausted.
10 In addition to numerous good numeric variables, the data provide several interesting options for dividing cars into different comparison groups (e.g., by DOMESTIC, TYPE, AIRBAGS, DRIVETRAIN, or MANUAL transmission). Most of our early analyses use only basic summary statistics or graphics, yet a discussion of side-by-side boxplots of highway MPG for domestic vs. foreign manufacturers lays a good foundation for later, more formal, work on testing hypotheses. As those techniques are subsequently developed, we can continue to come back to the car data -- establishing a familiar thread that can run throughout the course. We don't always have to be "on-line" in a computer session to use the data. Often a few summary statistics may be all that are required to motivate an example. Students can also be encouraged to do their own independent explorations.
11 As one might expect, there are numerous relationships among the variables which provide excellent examples for discussing scatterplots, correlation, and regression techniques. One can easily find pairs of variables which demonstrate strong or weak, positive or negative associations. PRICE and MPG variables tend to be popular choices as "dependent" variables in studying regression models, although students need to exhibit some care in approaching multiple regression situations since many of the potential predictors are often highly correlated among themselves.
12 We conclude by suggesting some specific ways the data may be used to illustrate certain topics. A little time spent exploring the data will quickly stimulate additional possibilities.
13 Box-whisker plot: PRICE or MPG variables give good examples of somewhat skewed data with potential outliers among the upper fences.
14 Small sample confidence interval for a mean: Look at HPW or RPM within one TYPE of car. Different students may be assigned different TYPEs.
15 Difference in means between two independent samples: Compare PRICE levels between DOMESTIC and FOREIGN cars. Also watch out for significant differences in the variances between these two groups.
16 One-way ANOVA for difference in means: Check out city MPG ratings between the three DRIVETRAIN categories.
17 Contingency table: Construct and analyze a two-way table of AIRBAGS by FOREIGN/DOMESTIC.
18 Scatterplot: Plot PRICE by MPG. Identify any unusual points.
19 Regression/correlation: REVOLUTIONS per mile tends to be a less familiar variable to the students. Investigate relationships and/or build models for REVOLUTIONS based on a restricted subset of other engine or body size variables.
20 An exam question: Provide computer output for investigating the relationship between number of CYLINDERS and MPG using only 4, 6, and 8 cylinder cars. Have students interpret the results of both a one-way ANOVA and a simple linear regression. Which approach is more appropriate?
21 The file 93cars.dat.txt contains the raw data. The file 93cars.txt is a documentation file containing a brief description of the dataset.
PACE New Car & Truck 1993 Buying Guide 993), Milwaukee, WI: Pace Publications Inc.
Student Edition of Execustat (1990), Boston, MA: PWS-KENT Publishing Co.
Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union.
Robin H. Lock
St. Lawrence University
Canton, NY 13617