Internet Data Analysis for the Undergraduate Statistics Curriculum

Juana Sanchez
UCLA

Yan He
UCLA

Journal of Statistics Education Volume 13, Number 3 (2005), www.amstat.org/publications/jse/v13n3/datasets.sanchez.html

Copyright © 2005 by Juana Sanchez and Yan He, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.


Key Words:Exponential; Internet traffic; Inverse Gaussian; Maximum likelihood; Negative Binomial; Poisson; Web server log data.

Abstract

Statistics textbooks for undergraduates have not caught up with the enormous amount of analysis of Internet data that is taking place these days. Case studies that use Web server log data or Internet network traffic data are rare in undergraduate Statistics education. And yet these data provide numerous examples of skewed and bimodal distributions, of distributions with thick tails that do not follow the usual models studied in class, and many other interesting statistical curiosities. This paper summarizes the results of research in two areas of Internet data analysis: users' web browsing behavior and network performance. We present some of the main questions analyzed in the literature, some unsolved problems, and some typical data analysis methods used. We illustrate the questions and the methods with large data sets. The data sets were obtained from the publicly available pool of data and had to be processed and transformed to make them available for classroom exercises. Students in Introductory Statistics classes as well as Probability and Mathematical Statistics courses have responded to the stories behind these data sets and their analysis very well. The message in the stories can be conveyed at a descriptive or a more advanced level.

1. Introduction

Statistics textbooks for undergraduates have not caught up with the enormous amount of analysis of Internet data that is taking place these days. Case studies that use Web server log data, Internet survey data or Internet network traffic data are rare in undergraduate Statistics education. We had to conduct a large amount of research and computer work to be able to assess the key areas of current Internet data analysis that are more likely to have an impact on business and the population of users in general. The two data sets and the story around each of them that we present below summarize that research. Those interested, can find more activities done with Internet data analysis at the CS-STATS web site that the main author created for instructors.

The two activities introduced in this paper, as well as those in the CS-STATS web site, are suitable for any level of Statistics pedagogy. Instructors interested in using them for service-type Introductory Statistics classes can choose the descriptive analysis and the quantile-quantile plots and curve fitting of the probability distributions appropriate at the introductory level. On the other hand, instructors in calculus-based Introductory Statistics or Mathematical Statistics classes can move beyond the descriptive analysis to maximum likelihood estimation, alternative distributions such as the inverse Gaussian and other distributions with thick tails if they wish. In Introduction to Probability courses, the stories presented below and the research literature on which they are based, can help expand the curriculum by introducing students to phenomena that may not have the usual distributions studied in the course. Classes focused on Data Analysis or Applied Statistics can use these activities too. Moreover, because of the size of the data set on web browsing behavior, instructors can use the data sets as the population from which random samples could be taken to illustrate the Central Limit theorem or sampling theory results.

One common thread in the two stories told below is that there is an increasing demand for Internet data analysis. Computer Scientists are being called more and more frequently to provide computer log data that can be used to understand users' web browsing behavior, to make web pages more responsive to users' needs. In addition to that, communication networks are providing data that can help understand how the network itself responds to users, to make the quality of the network better for users.

For the teaching of Statistics, there are some things in common in the two stories. Because the data and graphs that appear in Internet data analysis behave sometimes differently from what we teach students in the descriptive data analysis module, we can use these stories to reinforce what students already know, by contrast. In addition to that, the data sets are huge, much bigger than the ones we usually use in our teaching, presenting the student with the mystery of dealing with such monsters. Finally, but not the least, there are no definite answers yet, so the students are really being exposed to the ongoing search for new paradigms in the engineering, computer science and statistics community.

2. Internet traffic

Can we predict how the Internet network will perform at any time? An answer to this question would have the practical implication of helping optimize service provision to keep all customers and network administrators happy. However the Internet network is so heterogeneous and unregulated, and the lack of cooperation among individual servers is so prevalent that monitoring the network is a very challenging task. To monitor, good models are needed. But before the modeling stage is reached, a good understanding of the data is necessary. There seems to be now a consensus among researchers about the nature of the data but not about the best way to model that data. In this section we only unveil the nature of the data set we analyze and summarize some of the issues. The language is very specialized, and for this reason, we introduce some get-acquainted kind of information below (Gautam, 2003).

Messages that flow from a source to a destination through a network are also known as traffic. This traffic and the network conditions are extremely random in nature.

There are three types of telecommunication networks --telephony (telephone network for voice calls, fax, and also dial up connections), cable-TV networks (cable, web-TV, etc.) and high speed networks such as the Internet. We are concerned with high-speed networks.

Traffic flowing through the networks can be classified into several types. Two of the most common traffic types are Ethernet packets/frames and ATM cells. The length or size of an Ethernet packet ranges anywhere from 60 bytes to 1500 bytes and generally follows a bimodal distribution. The length of ATM cells is fixed at 53 bytes. Therefore the network traffic comprises of millions and billions of these little packets or cells. We are concerned with packets traffic.

The packets arise because when a message needs to be sent from a source to a destination, it is broken down into small packets and transported that way from the source to the destination.

The protocols responsible for this transport of packets over networks are user datagram protocol (UDP) and transmission control protocol (TCP). With UDP, the destination does not acknowledge the receipt of packets to the source. TCP is an acknowledgment (ACK) based protocol. In this paper, we are concerned with packets transported by TCP protocol.

There are several network performance measures that contribute to the Quality of Service of a network. Among others, we have: (a) loss probability, or the probability of delivering a message with some data loss; (b) delay or latency, the time lag between the source sending a message and the destination receiving it; (c) delay-jitter or measure of the variation of the delay; (d) bandwidth or rate at which messages are processed. These measures can be used for optimal design and admission control of the networks. Design deals with buffer sizes, link capacities, network parameters, traffic shaping parameters, and other. Admission control involves rejecting or accepting an incoming request for connection. These variables are very hard to measure; some researchers have come up with inference methods to estimate them. These methods are very complicated and difficult to understand by beginning statistics students. Because the nature of the data is so different from that of data we are more used to in our undergraduate classes, students should be able to understand that scientists are having a very hard time coming up with good models.

There are two main areas of research on Internet traffic data. One area is concerned with understanding how traffic data perform within a single route, i.e., in the connection between a pair of nodes in the network, which allows the use of stochastic process modeling (Willinger, Taqqu, Leland, and Wilson 1995; Willinger and Paxson 1998; Paxson and Floyd 1995). It has been found that the models used for telephone networks are not good for Internet traffic (Willinger, et al. 1995). The other area is more focused in modeling the simultaneous activity across all the nodes in the network, i.e., traffic measurements at different nodes based on carefully done sampling at different nodes. This latter approach is known as “network tomography” (Castro, Coates, Liang, Nowak, and Yu 2004) and it is very recent. The goal of both approaches is to predict measures of performance of the networks. The single route approach supporters believe that by careful selection of processes to model the traffic, more theoretical analysis can be done for multiple routes as well. Many attempts are being made by this group at maintaining the old queueing models that work so well with telephone networks, with some adaptations to the different behavior of Internet data (Guerin, Nyberg, Perrrin, Resnick, Rootzen, and Starica 2003).

It should be pointed out that because of the privacy of network data researchers are having a very hard time obtaining the data they need. But for the data sets that have been made publicly available, there are standardized ways to reduce data on traffic to statistics-friendly data. For example, Jeff Mogul used the steps described at this site ita.ee.lbl.gov/html/contrib/sanitize-readme.txt to make the TCP data set we use in this section user friendly.

The data discussed here can be obtained from the Internet Traffic Archive at: ita.ee.lbl.gov/html/contrib/DEC-PKT.html and is labeled dec-pkt-1.tcp. This data set summarizes traces of one hour's worth of TCP traffic between Digital Equipment Corporation and the rest of the world on March 8th, 1995. Paxson and Floyd (1995) also analyzed this data set. It describes 2,153,462 million packets and contains the following 6 variables.

The packet size and number of packets per unit of time are very important variables, and we will analyze them here.

Our analysis in this paper will be limited to some of the preliminary descriptive data analysis usually done. Since the data set we use is only for an hour, there are many things that other studies look at that we can not investigate. The R commands for the analysis done here can be found in Appendix A.

2.1 Descriptive analysis of packet size


Figure 1

Figure 1. Histogram of the number of data bytes in a packet.


A histogram of the number of bytes per packet is shown in Figure 1. We can see that it is bimodal as expected. The minimum value is 1 and the maximum is 1460. The most frequent packet size is around 512. Packets of that size arrive uniformly throughout the day, they are not concentrated in any particular hour. The correlation coefficient between the timestamp and the databytes variables is 0.01128, illustrating the lack of relation for a large majority of the data. However, one could do a plot of timestamp against databytes (not shown here) to see that the largest packages, those larger than 600, don't arrive uniformly throughout the hour.

2.2 Fitting an exponential to inter-arrival time

A question of interest in the traffic literature is: do the times between arrivals of two consecutive packets to Digital Equipment follow an exponential distribution? This question is important because if that is the case, there is still the possibility that some variant of Poisson models could be used for modeling the number of packets arriving per unit of time. A histogram of the timestamp between two arrivals, shown in Figure 2, reveals that inter-arrival times appear exponential, but the tail seems to be too long. Summary statistics show that according to the median, half the packets arrive within 0.000976 seconds of each other, while half take longer. The smallest interarrival time is 0, for packages that arrive simultaneously, and the maximum amount is 0.114. The consensus in the literature is that this variable is exponential. We present a qqplot to determine whether that is the case for our data.


Figure 2

Figure 2. Distribution of interarrival times of packets.


There is some indication of exponential-like behavior for small interarrival times, but the distribution is heavy tailed, and the exponential is not a good model due to large interarrival times. A quantile-quantile plot of the data against simulated exponential random numbers with the same average as our data, shows that there is lack of fit for large interarrival times, i.e., those interarrivals beyond the range one would expect from an exponential model. Figure 3 reveals that the quantiles of the data do not correspond to the same values as the quantiles in the simulated exponential. If the empirical data came from an exponential population, the points should fall approximately along the diagonal reference line. The greater the departure from this reference line, the greater the evidence for the conclusion that the data set have come from a population with a different distribution. Could the long tail be due to outliers? A boxplot (not shown) reveals that there are too many points outside the whisker to consider these tail points outliers. Rare interarrival times are not so rare in this data set.


Figure 3

Figure 3. QQ plot of the interarrival times and the exponential distribution.


2.3 Fitting a Poisson distribution to the number of packages per second

Unlike in telephone networks, where the number of arrivals per unit of time is Poisson, in Internet networks exponential inter-arrivals do not translate into Poisson behavior for the number of packets per unit of time or for the number of data bytes per unit of time. The mean number of packets arriving per second is 384.262 and the variance is 15306.35, clearly much larger than the mean. The lack of fit of the Poisson distribution to this data set is clearly shown in the quantile-quantile plot of Figure 4, which shows that the quantiles of number of packets per second are very different from the quantiles of a Poisson distribution with the same mean as the data. In fact, the fit is very poor.


Figure 4

Figure 4. QQ plot of the number of packets per second and a Poisson distribution.


This lack of fit has been attributed by computer scientists and engineers to the burstiness of packets per unit of time over time, regardless of the time scales at which we measure them. Many papers have been written showing how this periodic outbursts of activity over time are present in almost all the publicly available data sets. The challenge has been and still is to determine why. Most papers have resorted to asymptotic explanations such as long range dependence and self similarity, and to this day there is no consensus on this matter.

The data set we use in this section is too short to investigate the burstiness of the number of packets per unit of time thoroughly. But it illustrates what this behavior looks like in a plot. We created a data set called ``pacpersecond'' representing the number of packets arriving every second, and plotted it against an index variable representing seconds. Figure 5 shows that the number of packets per second displays burstiness. Changing the scale does not make the burstiness disappear; we created another variable called ``pacperminute'' that represents the number of packets arriving per minute. Figure 6 shows the trace of this variable against an index variable representing minutes. Again, the burstiness is present. If the counts were Poisson, burstiness in the data would disappear when the scale is aggregated. The fact that the burstiness in the data does not change whether we measure the number of packets per minute, per second, or millisecond is another indication that the data is not Poisson. In fact, any distribution whose characteristic is that more aggregated scales will not change the burstiness will not fit the packet data well. And this is what is making the modeling so difficult. (Willinger and Paxson 1998)


Figure 5

Figure 5. Trace of the number of packets per second.


Figure 6

Figure 6. Trace of the number of packets per minute.


The above descriptions of the data we are using in this paper indicate that this data set follows the same behavior as most of the data sets analyzed in the literature on network traffic. With these pieces of information, researchers are trying to determine what kind of model of network traffic will capture these characteristics. There are many other interesting types of analysis that one can do with longer data sets, but they all lead to the same conclusions. These other data sets can be found in the address given earlier in this section.

2.4 One Alternative to the Poisson distribution

The negative binomial distribution is very often proposed as an alternative model for counts when there is extra variation not explained by the Poisson. For this reason, we considered fitting a Negative Binomial to the packet data. The negative binomial has size parameter equal to 10.92. Figure 7 compares the box plots of the packets per second and the negative binomial. And Figure 8 shows the quantile-quantile plot of the data against the negative binomial.


Figure 7

Figure 7. Box plot of packets per second compared with box plot of a negative binomial distribution.


Figure 8

Figure 8. Q-Q plot of packets per second against the negative binomial distribution.


We can see in the quantile-quantile plot that the negative binomial fits the data better than the Poisson, however at very low and very high values of the variable pacpersecond the fit is not so good. The negative binomial is then not a perfect alternative to the Poisson but only an improvement. Instructors in Introductory Statistics or Mathematical Statistics courses could expand the exercises done here by making students fit other models that might account better for the thickness of the tail, like some of the distributions considered in the literature referenced at the end of the paper. This activity will certainly engage students in the current debate about the nature of Internet traffic data.

3. Web browsing behavior at the user level

Once a user enters a web site how many pages or links within the site does that user visit? The answer to this question may suggest actions to improve the site. If similar distributions for the number of pages visited per user are observed at different web sites, then maybe some laws can be established for all sites. Research efforts in this area are directed at finding these laws. Examples of these efforts are Hansen and Sen (2003), , and Huberman, Pirolli, Pitkow, and Lukose (1998), each analyzing a different data set.

Some of the analysis done in the literature to answer that question can be illustrated with data published in the UCI KDD Archive (Heckerman). We processed these data to obtain observations on the number of different pages visited by users who entered the msnbc.com page on September 28, 1999 and other information. A random sample of this data set was used by Cadez, et al. (2003) to do cluster analysis and visualization of the patterns (order) of visits followed by users, i.e., to see the frequency of whole sequences. This is a very important question, too. But we don't look into it in this paper.

The original data comes from Internet Information Server (IIS) logs for msnbc.com and news-related portions of msn.com for the entire day of September 28, 1999 (Pacific Standard Time). Each sequence in the dataset corresponds to page views of a user during that twenty-four hours period. Since there are 989818 users, there are 989818 sequences. This is a 22.6 MB size data set. Each event in a sequence corresponds to a user's request for a page. Requests are not recorded at the finest level of detail--that is, not at the level of URL, but rather, they are recorded at the levels of page category (as determined by a site administrator). The categories are ``frontpage'', ``news'', ``tech'', ``local'', ``opinion'', ``on-air'', ``misc'', ``weather'', ``health'', ``living'', ``business'', ``sports'', ``summary'', ``bbs (bulletin board service)'', ``travel'', ``msn-news'', and ``msn-sports''. As an example, we write below the sequence for the first three users in the data set (one line per user):

User 1 frontpage, frontpage
User 2: news
User 3: tech,news,news,local,news,news,news,tech,tech

We processed the original data set to obtain the variable ''length'', which represents the actual total number of links visited by each user. For example, user one has length=2 user two has length=1, and user three has length=9.

3.1 Descriptive Analysis of the length of visits

The average number of pages visited is 4.747, the median is 2 pages, the minimum is 1 and the maximum is 14800 pages. The histogram, in Figure 9, is very skewed.


Figure 9

Figure 9. Histogram of the length of visits with an Inverse Gaussian distribution with the same mean and variance superimposed.


Notice that the histogram contains only values of length less than or equal to 100, excluding those users that visited more than 100 pages. The longest visits are probably crawlers or maybe different people logged into the same IP address. One of the problems with web server log data is precisely what to do with these crawlers. Should they be included, should they not? Although we did not include them all in the graphs, all the numbers were used for the computations of the statistics. An important fact to observe is that most users visit few pages, but the tails are very long, indicating that some users visit a lot of pages.

3.2 Fitting the Inverse Gaussian distribution to the length of visits

What model should we use for this behavior? Huberman, et al. (1998) and other authors, recommended an inverse Gaussian distribution for the variable length (L). This distribution has two parameters and is described by the formula

The mean and variance , where is a scale parameter. This distribution “has a very long tail, which extends much farther than that of a normal distribution with comparable mean and variance. This implies a finite probability for events that would be unlikely if described by a normal distribution. Consequently, large deviations from the average number of user-clicks computed at a site will be observed” (Huberman, et al. 1998, pg. 95). Another property is that “because of the asymmetry of the distribution function, the typical behavior of users will not be the same as their average behavior. Thus, because the mode is lower than the mean, care must be exercised with available data on the average number of clicks, as this average overestimates the typical depth being surfed.”

It can be shown that the cumulative distribution function of the inverse Gaussian distribution is

where is the standard normal distribution function.

Is the inverse Gaussian really a good model for the data we have? It is instructive to follow the guidelines given in the references mentioned above to answer this question.

Theoretically, by maximizing the likelihood function, the equations for the maximum likelihood estimators (MLE) of and in the inverse Gaussian distribution given above can be found to be

and

For the msnbc.com data, we find:

The inverse Gaussian with these estimates is fitted to the histogram in Figure 9. Visually, it is not a perfect fit for lower values of length, which is where the majority of the data are concentrated. And we don't show the tails, so we can not conclude from this plot that the fit is good over the whole distribution.

Commands in R to do these computations and the ones that follow can be found in Appendix B.

To see how good is this model, Huberman, et al. (1998) and Hansen and Sen (2003) compared the cumulative distribution function implied by the model to the empirical cumulative distribution function derived from the data. Then they use a probability-probability plot against the fitted distribution. We do the same with the length variable; the plots can be seen in Figure 10.The p-p plot reveals a misfit of the inverse Gaussian model to our data at the lower values of length. Hansen and Sen (2003) got similar results with the bell-labs.com data set they used.


Figure 10

Figure 10. Left: Empirical (ooo) and Inverse Gaussian (---) cumulative distributions.
Right: p-p plot.


Another way of investigating whether the inverse Gaussian is a good model, is based on the following fact: If you take logs on both sides of the inverse Gaussian formula you obtain

Thus a plot of log(L) vs log(frequency) should show a straight line whose slope approximates -3/2 for small values of L and large values of the variance. This is because if we substitute for on the right hand side of the formula, i.e., which follows from the formula for the variance, the variance appears in the denominator and the mean in the numerator. For small mean, which is the case here, and large variance, which is also the case for our data, the second term is almost 0. A plot of the frequency distribution of surfing clicks on the log-log scale can be seen in Figure 11. According to this plot andthe theoretical result, the regression line for the whole range of the data has a slope of -1.9, which is not too far from -1.5, so this result holds approximately for our data.


Figure 11

Figure 11. Plot of frequency distribution of length in log-log scale.


The log-log plot also helps us notice that, up to a constant given by the third term, the probability of finding a group surfing at a given level scales inversely in proportion to its length, . This is a characteristic that appears in a lot of Internet data sets. We don't pursue it further here, but it is at the heart of the debate about the nature of the data and the best possible model.

The appeal of the inverse Gaussian is in its decision theoretic foundations: it is the distribution that would result if visitors to the web site were optimizing their utility (Huberman, et al. 1998). But based on the results above, would we recommend the inverse Gaussian model for the length of visits (or number of links that a user visits) in the msnbc.com data set or other web server log data? This is one of the questions still unanswered and in need of more research. Several authors have tried other distributions with other data sets, for example, the geometric, the Poisson and a power law, but none of these distributions have fit the data well, either in the upper tail or in the lower levels of length. See, for example, Baldi, Frasconi, and Smyth (2003). A power law distribution tends to give a good fit at the tail, but it fails in fitting the lower values, which is where most of the observations are concentrated.

We shall not dwell further on Web browsing behavior research. But before we conclude this section, we would like to point out that the above is just the tip of the iceberg. Once the distribution of ``length'' is settled, the next step for researchers is to model the sequence of requests by users. Huberman, et al. (1998) model them using a simple first order Markov model. Hansen and Sen(2003) try a first and second-order Markov model, a finite mixture of first-order models, and a Bayesian approach. Cadez, et al. (2003) investigate simple Markov models for different clusters of users. The objective of these modeling attempts is to determine the best model to predict a user's next page request. Pages with higher probability of being requested can then be made more accessible.

Interested readers can experiment in class with many of these questions, either using the raw data, or three different processed data sets that we extracted from this raw data set, and that are available at the CS-STATS web site that we will be glad to provide upon request. Perhaps the reader can obtain Web server log data from the school where this material will be taught. In the latter case, be aware that raw log server data with URLs and detailed computer information needs to be converted to something like the raw data of Heckerman using Perl or similar programs. After that, you can process it further to use it for data analysis. The CS-STATS web site has some Perl scripts that could be used to that end.

4. Conclusion

As the reader may be able to ascertain from the above stories, there are plenty of interesting projects for our students in the analysis of Internet data. The data sets that most researchers are using to test their models are publicly available. These data sets, after appropriate processing, can be assigned to students to investigate questions similar to those summarized above or more advanced ones. The instructor in the AP Statistics or College level introductory statistics class can do simple descriptive analysis, and instructors in calculus-based Introductory statistics or Mathematical Statistics classes can include the maximum likelihood analysis and model fitting.

What is most relevant is that the analysis of Internet data is at its early stages and therefore there are many unsolved questions and no established paradigm. By involving students in those questions, we are making them active participants in current debates, without leaving the classroom. Interested readers can contact the authors to access exercises and data sets that we have prepared for use in the classroom.

5. Getting the Data

Earlier in Section 2 of the paper we gave the web address where the large trace of tcp traffic could be obtained. Most of the analysis in this paper is done with the whole data set. But that data set is 61.1 MB large, so we extracted from it 50000 observations or about 2 minutes worth of traffic. This reduced data set is in the file packetdata.dat.txt. The file packetdata.txt contains the description of this data set. Notice that because of the time series nature of the data, extracting a random sample from the whole data file would destroy its meaning. The file msnbclength.dat.txt contains the length of the clickstreams of a random sample of visitors to the msnbc web site during the period studied in this paper. The file msnbclength.txt contains the description of the data. We did the analysis using the large dataset, which is 22.6 MB large. Those interested in replicating the analysis done here with the larger data set, can access the large file at the web address given in the file msnbclength.txt.


Appendix A

We present here the R program used to obtain the results and graphs of Section 2. The reader using the data set packetdata will be reading only 50000 observations, or about 2 minutes of traffic, hence you will not be able to see Figure 5 and Figure 6, so you can exclude all the part of the program computing the pacpersecond and pacperminute. To find these, you will need to go to the whole data set. You can find the address in the paper, Section 2 and in packetdata.doc. Alternatively, you can obtain the large data set from the CS-STATS web site . The conclusions reached from the smaller data set are the same as those derived from the larger one, so for classes, the 50000 observations is plenty (with exception made for the plots mentioned above). The first author uses this smaller file for class assignments.

dec1 = read.table("packetdata.dat.txt",header=TRUE)

timestamp = dec1$timestamp[dec1$databytes>0]
databytes = dec1$databytes[dec1$databytes>0]
source=dec1$source[dec1$databytes>0]
destination=dec1$destination[dec1$databytes>0]
sourceport=dec1$sourceport[dec1$databytes>0]
destport=dec1$destport[dec1$databytes>0]

hist(databytes)
summary(databytes)

plot(timestamp,databytes)

arrivals = matrix(timestamp)
n=(length(timestamp))-1
interarrivals = matrix(rep(0,n),ncol=1)
for(i in 1:n) {interarrivals[i]= arrivals[i+1]-arrivals[i]} #we are measuring the time between each two arrivals

par(mfrow=c(2,1)) # open space to put two graphs together
hist(interarrivals, main="histogram of interarrivals in the data") # don't copy this graph yet

summary(interarrivals)
lambda=1/mean(interarrivals)
exponential =rexp(n,lambda) #generate exponential random variables
hist(exponential, main="histogram of simulated exponential", xlim=c(0,max(interarrival)) #copy and paste the graph window now.
dev.off() #close the graph window
qqplot(interarrivals,exponential,main="qqplot of data vs model")
abline(0,1)

pacpersecond = matrix(c(rep(0,3600)))

par(mfrow=c(1,2))
negbinomial = rnbinom(3450, mu = 384.2629, size = 384.2629^2/(15306.35-384.2629))
boxplot(pacpersecond,ylim=c(0,1300),main="Packets per second")
boxplot(negbinomial,ylim=c(0,1300),main="Negative binomial")
qqplot(pacpersecond,negbinomial,main="quantile-quantile plot")
abline(0,1)

for(i in 1:3600) {pacpersecond[i]=length(timestamp[(floor(timestamp))==i])}
pacpersecond=pacpersecond[pacpersecond>0]
plot(pacpersecond,type=''b'')
qqplot(pacpersecond, rpois(length(pacpersecond), mean(pacpersecond)),ylab="Poisson")
abline(0,1)

pacperminute = matrix(c(rep(0,60)))
pacperminute[1]= sum(pacpersecond[1:60])
for(i in 2:60)
{j=(i-1)*60+1
k=i*60
pacperminute[i]=sum(pacpersecond[j:k])
}
plot(pacperminute,type=''b'')


Appendix B

We present below the R program used to obtain the results and graphs of Section 3. Notice that the data set read in this program is the reduced version, a random sample of 50000 users from the larger, 22.6 MB file used in this paper. The address for the latter can be found in the file msnbclength.txt. The results are about the same for a file with 50000 as they are for the larger number of users in the big file. For class assignments, the first author uses only the 50000 observations.

# Read data, do histogram and superimpose the inverse Gaussian

length=read.table("msnbclength.dat.txt")
length=length$length
length=length[length<=100]
library(SuppDists)
postscript("histogram.eps",horizontal=FALSE)
par(mar=par("mar")+c(0,0,15,0))
hist(length,prob=T)
pts=seq(from=par("usr")[1],to=par("usr")[2],len=length(length))
lines(pts,dinvGauss(pts,4.747129,3.081917),xpd=T)
dev.off()

# Do the plot of log length against log frequency and
# find regression estimates

freqtable = read.table("frequencytable",header=TRUE)
length=freqtable$length
frequency=freqtable$frequency
plot(log(length),log(frequency))
lm(log(frequency)~log(length))

library(SuppDists)
length=read.table("msnbclength.txt",header=TRUE)
length=length$length

n=length(length)
lengthsum=sum(length)
lengthtable=table(length)
lengthfrequency=table(length)/n
cumlength=cumsum(lengthfrequency)

mu=4.747129
lambda=3.081917

l=seq(1,100)
prob=pinvGauss(l,mu,lambda)

par(mfrow=c(1,2))
plot(l,cumlength[1:100], type="p", xlab="length", ylab="Cumulative probability")
lines(l,prob)
plot(prob,cumlength[1:100], xlab="Inverse Gaussian CDF", ylab="Empirical CDF")
lines(l/100,l/100)
dev.off()


Acknowledgments

We thank the referees and W. Robert Stephenson, the editor of the JSE, for their very helpful comments and suggestions, which have improved the paper considerably. The research in this paper was funded by the Office of Instructional Development at UCLA, under year 2003 Instructional Improvement Grant OID IIP 03-20 to the main author, to whom all correspondence should be addressed. The second author is a graduate student in Statistics who collaborated in the material of Section 3. The contents of the paper are a small part of a larger project intended to create activities and data sets for undergraduate students in Computer Science taking their first course in Statistics. The first author of the paper has used the activities in almost all her classes, at one level or another, with good response from students. We would also like to thank Walter Rosenkrantz, Jeff Mogul, Zhiyi Chi and Mark Hansen who provided some references for consultation at the beginning of the project.


References

Baldi, P., Frasconi, and P., and Smyth, P. (2003), “Modeling the Internet and the Web,” in Probabilistic Methods and Algorithms, New York: Wiley & Sons.

Cadez, I., Heckerman, D., Meek, C., Smyth, P., and White, S. (2003). “Model-based clustering and visualization of navigation patterns on a Web site,” Journal of Data Mining and Knowledge Discovery, 7(4), 399-424.

Castro,R., Coates M., Liang, G., Nowak R., and Yu, B. (2004), “Network Tomography: recent developments,” Statistical Science, 19(3), 499-517.

CS-STATS www.stat.ucla.edu/~jsanchez/oid03/csstats/cs-stats.html

Digital Equipment Corporation. The traces were made by Jeff Mogul (mogel@pa.dec.com) of Digital's Western Research Lab (WRL). The trace correspond to DEC-WRL-1

Gautam, N. (2003), “Stochastic Models in Telecommunications for Optimal Design, Control and Performance Evaluation,” Handbook of Statistics, Vol. 21, eds. D.N. Shanbhag and C.R. Rao, Elsevier Science B.V.

Guerin, C.A., Nyberg, H., Perrin, O., Resnick, S., Rootzen, H., and Starica, C. (2003) “Empirical Testing of the Infinite Source Poisson Data Traffic Model,” Stochastic Models, 19(2), 151-200.

Hansen, M.H. and Sen, R. (2003), “Predicting Web User's next access based on log data,” Journal of Computational and Graphical Statistics, 12(1), 143-155.

Heckerman, D. The UCI KDD Archive (kdd.ics.uci.edu) Irvine, CA: University of California, Department of Information and Computer Science. The URL for the data used in this paper is kdd.ics.uci.edu/databases/msnbc/msnbc.dat.txta.html

Huberman, B.A., Pirolli, P.L.T., Pitkow, J.E., and Lukose, R.M. (1998), “Strong Regularities in World Wide Web Surfing,” Science, 280, 95-97.

Paxson, V. and Floyd S.(1995). “Wide-Area Traffic: The Failure of Poisson Modeling,” IEEE/ACM Transactions on Networking, 3(3), 226-244.

Willinger, W. and Paxson, V. (1998), “Where Mathematics meets the Internet,” Notices of the AMS, September 1998, 961-970.

Willinger, W., Taqqu, M.S., Leland, W.E. and Wilson, D.V. (1995), “Self-similarity in High-Speed PAcket Traffic: Analysis and Modeling of Ethernet Traffic Measurements,” Statistical Science, 10(1), 67-85.


Juana Sanchez
Department of Statistics
UCLA
Los Angeles, CA
U.S.A.
jsanchez@stat.ucla.edu

Yan He
Department of Statistics
UCLA
Los Angeles, CA
U.S.A.
yanhe@stat.ucla.edu


Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications