NAME: msnbclength.dat (Internet Data Analysis for Undergrad Curriculum) TYPE: Observational SIZE: 50000 rows, one for each user. DESCRIPTIVE ABSTRACT: The data set gives a random sample of the length of visits of users entering the msnbc.com web site during September 28, 1999. The length of the visit is an estimate of the total number of clicks or pages seen by each user and is based on web server logs, thus it counts only pages recorded by the server. Pages cached in the user's browser or in a cache proxy server are unknown. The data set used in the paper is much larger than the one made available here but that larger data set is also available in a page cited in the references. SOURCE: The data were extracted from the clickstream data set in the UCI KDD Archive which itself comes from Internet Information Server (IIS) logs for msnbc.com and news-related portions of msn.com processed by Heckerman, 2003. The reader is welcome to request from the authors the Perl program that converts the clickstream data into the length data described here. VARIABLES DESCRIPTIONS: Length Numerical variable summarizing the length of the visit to msnbc site. There are no missing values STORY BEHIND THE DATA: Once a user enters a web site how many pages or links within the site does that user visit? The answer to this question may suggest actions to improve the site. If similar distributions for the number of pages visited per user are observed at different web sites, then maybe some laws can be established for all sites. Research efforts in this area are directed at finding these laws. This is a small part of the current effort to understand human behavior on the web. PEDAGOGICAL NOTES: The length data set is interesting to introduce students to the notion of skewed distribution with thick tails, where rare events are not so rare. This is a common feature of a lot of Internet data, which makes the probability distributions we usually teach inappropriate to model their behavior. In an Introductory Statistics class that is calculus based, or a mathematical statistics class, the length data set gives students a chance to discover the inverse Gaussian distribution and to do q-q plots of the data against that distributions suggested in the literature. Plots of histograms and qq-plots and summary statistics should be done for length less than 100, as there are some lengths in the data set that are much higher and obscure the behavior below 100. In the lower division Introductory Statistics class, the data can be used to illustrate with box plots that the outliers are numerous in the skewed distribution of the data, too many to be just outliers, and introduce the notion of thick tail distributions. All the standard descriptive data analysis can also be done. Also, sampling to illustrate the Central Limit theorem can also be done. REFERENCES: http://www.stat.ucla.edu/~jsanchez/oid03/csstats/index.htm (this site contails the large data set used in the paper msnbclength.txt. http://kdd.ics.uci.edu/databases/msnbc/msnbc.html SUBMITTED BY: Juana Sanchez UCLA Department of Statistics 8125 Math Sciences Building Box 951554 Los Angeles, CA 90095-1554 jsanchez@stat.ucla.edu