A random sample of Wake County, North Carolina residential real estate plots

Roger Woodard and Jason Leone
North Carolina State University

Journal of Statistics Education Volume 16, Number 3 (2008), www.amstat.org/publications/jse/v16n3/datasets.woodard.html

Copyright 2008 by Roger Woodard and Jason Leone, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

Data Description

Name: A random sample of Wake County, North Carolina residential real estate plots
Type: Random Sample
Size: N = 100, 11 variables

Descriptive Abstract:
The information for this data set was taken from a Wake County, North Carolina real estate database. Wake County is home to the capital of North Carolina, Raleigh, and to Cary. These cities are the fifteenth and eighth fastest growing cities in the USA respectively, helping Wake County become the ninth fastest growing county in the country. Wake County boasts a 31.18% growth in population since 2000, with a population of approximately 823,345 residents.

This data includes 100 randomly selected residential properties in the Wake County registry denoted by their real estate ID number. For each selected property, 11 variables are recorded. These variables include year built, square feet, adjusted land value, address, et al.

Wake County, via http://services.wakegov.com/realestate/, on 3-25-08

Variable Descriptions:
ID # - the county-given identification number for the selected plot
Year Built the listed year in which the structure was built (by year)
Sq. Ft. the area of the floor plan in square feet (in square feet)
Story how many stories the structure has (in stories)
Acres how many acres in included in the plot (in acres)
No. Baths the number of bathrooms at the residence (in bathrooms)
Fireplaces the number of fireplaces in the residence (in fireplaces) Total $ the total assessed value of the property (in dollars)
Land $ the assessed value of the land (in dollars)
Building $ the assessed value of the building (in dollars)
Zip the zip code of the property

Empty cells represent a value not included in the property record

Story Behind the Data:
With Wake County being nationally ranked for its growth over recent years, the size and scale of the databases with public data on the properties is becoming more readily available. These databases are utilized by Dr. Woodard in one of the courses he teaches through a CAUSEweb.org activity because of the information that can be obtained and used for correlation analysis such as the many variables listed above. This data was collected as a tool to show and compare results from students data sets collected in the same manor.

Special Notes:
This data set was not compiled using the first 100 randomly obtained real estate identification numbers. Approximately 140 numbers were tried in order to obtain this set of 100, while the ones not included were either non-residential plots or were records that do not exist. The real estate ID numbers varied between approximately 1 and 200000, which were randomly generated using Microsoft Excel. All the data were found on the Wake County website, and were not altered in any way.

There is an activity posted on CAUSEweb.org by Dr. Woodard in which students would collect their own version of this data set. A PDF version of this activity can be located at http://www.causeweb.org/repository/Realestate/Realestate.pdf.

Pedagogical Notes:
The most prevalent statistical characteristic of this data is the presence of a natural outlier. The value in particular is real estate ID number 78570. This property is an outlier in two ways that can be easily determined graphically in order to help the students visualize the affect an outlier has on regression lines. It includes 39.38 acres while no other entry has more than 2 acres. The amount of acreage causes the land values and total values to increase over 4.75 million dollars, much larger than the rest of the values of other plots. Students can use this outlier to examine the impact of an outlier on regression and on correlation. Also, the students can be asked to identify the reason or reasons why this entry is an outlier.

Of course regression analysis can be used to determine which of the variables are good predictors of total value (simple linear regression). Students can be asked to graph variables against total value; for example, to graph square feet versus total value to examine the correlation coefficient and the model of the regression equation for comparisons to the others. Multiple regression can be used to investigate which sets of variables are good predictors of total value; for example Year Built, Sq. Ft. and Land $ do quite well when the million dollar homes are removed.

Link to Data Set: http://www.amstat.org/publications/jse/v16n3/woodard.xls

http://services.wakegov.com/realestate/, on 11-2-08

Roger Woodard, Ph.D.
Undergraduate Program Director
Department of Statistics
North Carolina State University
Raleigh, NC 27695
919 515-1938

Jason Leone
North Carolina State University
Department of Statistics
Raleigh, NC 27695

Volume 16 (2008) | Archive | Index | Data Archive | Resources | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications