PART I. Graphical Analysis of Recurrent Events Data
Wayne Nelson, Consultant, Schenectady, NY, WNconsult@aol.com
Recurrent events data arise in many fields. Previous models and methods apply only to the observed numbers of events. This introductory lecture presents a new nonparametric model and graphical and analytical methods for such data which include costs or other numerical values associated with events. Examples include treatment costs and durations of recurrent disease episodes and costs or downtime for product repairs. The methods also apply to a mix of different types of events, such as different patient outcomes or types of repairs. The methods are illustrated with applications to recurrent diseases, product repairs, and births of babies to statisticians. Needed theoretical developments will be mentioned.
PART II. Regression Analysis of Recurrent Events Data
Jerry Lawless, University of Waterloo, jlawless@uwaterloo.ca
A sequel to Wayne Nelson's lecture on graphical analysis, this IOL considers ways in which covariates can be incorporated in an analysis, focusing on models and methodology that are closely related to the framework described by Dr. Nelson. Robust methods for the analysis of rate of recurrence and mean cumulative functions will be described and illustrated on data arising in both industrial and medical contexts. The analyses considered can be carried out using standard software, and details concerning how to do this with SAS or Splus/R will be provided. Finally, some extensions to the methodology will be discussed, as well as topics needing development.
Organizer: Angela Dean
Time: Sunday, August 7, 2005, 4.00-5.50pm
This is a novel session in which ten speakers will present outlines of different types of case studies. These will be made available to members of the audience via a website after the conference. The case studies can be used for teaching materials for a wide range of audiences. The talks are as follows:
Statistics for New Products: Concept to Market in 10 Minutes
Fred Hulting and Jon Coltz, General Mills, Inc'Taking a new product from idea to reality requires many difficult steps. Among them are the identification and optimization of the concept and product, the startup and refinement of the associated manufacturing process, and the development of distribution, sales, and marketing strategies. Each step in this new product development process requires the integration of multiple sources of information in order to make sound, data based, business decisions. Thus, statistics plays a vital role throughout this process. This ten-minute case study will highlight the role of statistics in developing and delivering a new food product to the marketplace.
Computer Experiments: Some Knee-Jerk Applications
William I. Notz, The Ohio State UniversityComputer experiments arise when complex computer code is used to model a physical phenomenon and experimentation is done on the code, either in addition to or in place of experimentation on the physical phenomenon. In this talk, I will briefly describe some of the basic statistical issues that arise in computer experiments, outline some of the basic statistical methodology that has been used to design and analyze data from computer experiments, and then illustrate the methodology in the design of hip and knee prostheses. The prosthesis design problems will also serve as a context for my description of methods that have been developed to design and analyze data from computer experiments.
Designing an Aging Study
Joanne Wendelberger, Los Alamos National LaboratoryAging studies are often conducted to monitor a population for possible signs of degradation. Depending on the particular application, such studies may involve several experimental factors, multiple responses, repeated measures over time, and other interesting statistical issues. A case study will be used to illustrate the process of designing an aging study, taking into account the objectives of the study and the statistical challenges posed by the nature of the particular case study. Tradeoffs between alternative experimental plans will be examined. Practical considerations and interactions with the subject matter experts play a key role in carrying out the design phase of the aging study.
Seven Error Terms! Are You Kidding???
Thomas M. Loughin, Kansas State UniversityA single experiment may involve several different factors, each applied to experimental units of completely different shapes and sizes. The appropriate ANOVA model may not be obvious, especially if the design is not one found in any text book. As long as one can identify the experimental units for each factor or combination of factors, however, the task becomes relatively straightforward. An example of an agronomic field trial is presented in which four factors are arranged in such a way that seven distinct experimental units are created. Identification of these units and their incorporation into an ANOVA model is demonstrated.
Repeated Measures, Split Plots, and Missing Data: A Mobile Computing Field Study
Daniel T Vossand Jennie J. Gallimore, Wright State University and Mary McWesler, University of Dayton Research InstituteThis talk concerns the design and data analysis for an experiment conducted to study the use of mobile computing devices for real-time navigation and communication. Attention is restricted to studying the effects of display type and visual presentation format on navigational performance. The planned experiment was a 2x3 design in two treatment factors, with 12 subjects and six observations per subject. The experiment was run over 24 days. Each subject was involved for two distinct days, with three observations collected per day on the subject. Days are viewed as whole plots and runs within a day as split plots, with one between-whole-plots factor and one within-whole-plots factor. To collect an observation, a subject navigated one of six paths. The design used is balanced for path and run order. Each path included north, south, east and west legs. The belated inclusion of leg direction as a third factor of interest made this a split-split-plot design. The observation for one run could not be used.
Designing and Identifying Multi-Stage Experiments
Derek Bingham, Simon Fraser UniversityThe statistical design and analysis of experiments is frequently used in industrial applications. Complete randomization of the experiment trials is frequently impractical when it is too costly or even impossible to change the level settings of some factors. In this talk a general approach to the construction of factorial designs with randomization restrictions at multiple stages is discussed. The approach is illustrated on an investigation of a spiked plutonium alloy manufacturing process that took place at Los Alamos National Laboratory.
Statistical Modeling of a Chemical Reaction
Martha Gardner General Electric Global ResearchThis case study involves the modeling of a chemical reaction in a Methyl-Chloro-Silane (MCS) process. Different catalysts and levels were studied through a designed experiment, but due to the nature of the experimental process and the effect of time, it has an element of repeated measures as well.
The Cross-over design: '2-D or not 2-D?
Reid D. Landes, University of Arkansas for Medical Sciences, and John VanDyk, Iowa state UniversityWe describe a two-treatment, two-period cross-over experiment and its subsequent analysis. The experiment was designed to evaluate the effectiveness of an experimental teaching method as compared to a standard. The standard and experimental teaching methods were, respectively, the use of two-dimensional (2-D) diagrams and interactive three-dimensional (3-D) object movies of (insect) specimens for acquiring identification skills of morphological features. This particular experiment is useful in teaching the cross-over experiment design as it appeals to a variety of disciplines, the data set is small enough for hand calculations, and the results are clear.
The BHH Tomato Fertilizer Experiment Revisited
Rob Easterling, Itinerant ProfessorArchie Bunker once told his son-in-law, "Don't give me no stastistics, Meathead. I want facts!" We statisticians get our kicks from stastistics (i.e., the technical aspects of statistical data analysis), while our clients and collaborators are turned on by the facts (the science or business insights provided by data). If we want these professionals to be passionate about the use of statistical methods in their work, we need to connect their enthusiasm for their chosen fields with our enthusiasm for statistical methods. Statistical texts are limited in space and often cannot make this connection, but instructors can by embedding textbook examples in credible scientific or business contexts. I illustrate this approach with a tomato-fertilizer example from the classic experimental design text by Box, Hunter, and Hunter. This example has realistic features, not brought out in the text, such as a lurking variable and an apparent outlier that can only be seen in the context of this lurking variable, and it can also be extended to consider the implications of the analysis for subsequent actions by either a casual gardener or a commercial grower.
Eye, Robot: Experiences with driving simulator data
Russell Lenth, University of Iowa
Organizer: Jacqueline Hughes-Oliver
Time: Monday, August 8, 2005, 2.00-3.50pm
Sponsored by the Chemometrics group
Combinatorial chemistry libraries are synthesized by putting together a full factorial of chemical building blocks. For example, all ten As might be combined with all twenty Bs giving 200 reaction products. A resulting design problem is how to pick the ten As and twenty Bs from among the thousands of each that are available. Once the 200 resulting compounds are tested in a biological system, how should the analysis proceed? Given the resulting statistical model, how should the next set of As and Bs be selected? Successful application of statistics will make this expensive process more effective. The talks are as follows:
How Does One Describe a Molecule to a Computer?
Yvonne Martin, Abbott LaboratoriesBecause molecules are flexible three-dimensional objects that project many different properties into three-dimensional space, in order to apply any computational algorithm to chemical compounds one must transform the molecular structure into a vector or matrix of numbers. This talk will discuss the various approaches used and the interplay between the types of descriptors and the purpose for which they will be used.
A three-block analysis of chemical reaction data
S. Stanley Young and Sean Ge, National Institute of Statistical Sciences, and Salvadore Profeta, Jr., University of South Carolina.In planning the synthesis of a combinatorial chemical library, it is typical to do a large-scale rehearsal of the potential reactions. In a three-part reaction, a subset of the A, B, and C combination is reacted to assess the potential success of the full factorial of reactions. With a good design, the three faces of the full factorial cube are completely or nearly completely populated. Each A, B, and C reagent can be numerically characterized. Our idea is to use an L-shaped three-block analysis to predict the reaction success for a face of the full factorial cube. The apex block of data gives the reaction yields. The other two blocks give the numerical characteristics of two of the reagents. Using the analysis of the three faces, we should be able to predict the combinations of A, B and C that are likely to be successful in the full library synthesis.
Identifying quantitative structure-activity relationships using Optimal Bit String Trees
Ke Zhang and Jacqueline Hughes-Oliver, North Carolina State University, and S. Stanley Young, National Institute of Statistical Sciences.A new method called the Optimal Bit String Tree (OBSTree) is proposed to identify quantitative structure-activity relationships (QSARs) to relate chemical structural features to biological activity. The method introduces the concept of a chromosome to describe the presence/absence context of a combination of descriptors. A descriptor set combined with its optimal chromosome forms the splitting variable in OBSTree, and this splitting variable is optimized under a new stochastic searching scheme. Results from simulation studies and application to real data show that OBSTree is advantageous in accurately and effectively identifying QSAR rules in combinatorial chemistry libraries.
Discussant: Douglas M. Hawkins, University of Minnesota
Organizer: Angela Dean
Time: Tuesday, August 9, 2005, 10.30am-12.20pm
In a choice or conjoint experiment, an experimental design defines sets of products that vary on brand, price, and a variety of other product-specific attributes. Subjects choose from among the alternatives, just as consumers choose products from a shelf. Different aspects of designing and modelling such experiments will be explored and discussed. Statistical models that incorporate response time will be discussed. These provide information beyond the scope of preference for product concepts. New experimental designs for choice experiments will be presented. The talks are as follows:
The Design of Stated Choice Experiments
Warren F. Kuhfeld and Randall D. Tobias, SAS Institute Inc.Marketing research and choice modeling are increasingly becoming concerns for statisticians in industry. Choice designs define sets of products that vary on brand, price, and a variety of product-specific attributes. Subjects choose between the alternative products just as consumers choose products from a shelf. Choice modeling is used to understand attribute importance, how the attributes influence choice, and ultimately to design products that will maximize consumer interest and profit. The experimental designs for choice models are often quite large and complex relative to most other DOE applications. This is because the designs must realistically represent a complex marketplace, because the cost of data collection is relatively low, and because choice designs consist of sets of products with one factor for every attribute in every set. We will introduce the problem of designing choice experiments and discuss a general approach that uses both combinatorial methods and integer optimization to make a huge variety of both regular and nonregular designs including highly restricted designs.
Efficient Designs for Conjoint Analysis and Discrete Choice Experiments
Rainer Schwabe, Otto von Guericke UniversityIn many fields like market research, market segmentation or personell decisions conjoint analysis or discrete choice experiments are performed to measure the potential decision behavior of consumers. The information will be used to judge the acceptance of potential future products in the market. The utility of such a product is determined by a variety of attributes. This utility cannot be measured directly. For discrete choice experiments only (binary) preferences are available form a set of possible alternatives while in linear paired comparisons the quantitative magnitude of the preference may be determined. Moreover, due to the complexity of the decision task often only partial profiles are presented, i.e. the alternatives are specified by a (small) subset of attributes. For these situations optimal or efficient designs are presented which determine the choice of the alternatives. The results are strongly related to balanced incomplete block designs and orthogonal arrays. Extensions are indicated for situations where also interactions of attributes may influence the utility.
An Integrated Model of Choice and Response Time with Applications to Conjoint Analysis
Greg Allenby and Thomas Otter, The Ohio State University.With the proliferation of computer and web-based interviewing tools, process data such as response time arises as a natural by-product of many conjoint applications. Despite the immediate availability of this data, surprisingly few attempts to integrate response time into choice models have been made. Available models that take response time into account treat the observed response times as another explanatory variable and conduct conditional inference on the probability of choice. We investigate use of a model that treats choice and time as dependent variables of an underlying psychological process. Product profiles in a choice set induce a signal generating process in the respondent's mind, where more attractive profiles are assumed to generate signals at a faster rate. The profile that first generates a cumulative number of signals equal to a consumer-specific threshold is chosen. Thus the underlying psychological process links the conditional choice probability to the marginal density of response time via some parameter vector that needs to be estimated. Our model integrates choice data, which indicates a relative ranking of the choice alternatives, with response times that point to the magnitude that the best alternative is preferred.
Organizer: Angela Dean
Time: Wednesday August 10, 2005, 8.30-10.20am
There has been considerable recent interest in the collection and analysis of internet traffic data in order to estimate quality of service parameters and to design and monitor computer and communications networks. Voice communication is moving form the traditional public switched telephone network (PSTN) to the Voice-over-IP (VoIP) technology of the Internet. VoIP technology considers the statistical properties of an aggregate of calls and makes sure the bandwidth of each link can handle the statistical fluctuations and provide quality-of-service with high probability. This session will feature talks on various aspects of the statistical problems associated with internet traffic. The planned talks are as follows:
Developments in Network Tomography
Earl Lawrence, George Michailidis and Vijayan Nair, University of MichiganThe Internet is a rapidly changing and important environment. One minute your site sits in total obscurity. The next minute, your traffic is slashdotted into a frenzy. The minute after that, a bunch of hackers orchestrate a DDoS on your now popular Internet domain. The minute after that, you return to obscurity as the flash crowds decide that they can't get through and your site probably wasn't that interesting anyway. Monitoring and avoiding these issues has become an important field of study for both computer scientists and statisticians. In this talk, we will present estimation procedures for estimating link-level packet loss rates and delay distributions based on end-to-end active measurements. Our active probing packets are considered part of spatio-temporal time series. Our scheme deconvolves the end-to-end data into a link-level model. Further, we present methods for monitoring this model in order to detect and locate changes over time. We allow the active measurements to be calibrated based on passive monitoring on selected links. We explore issues involving the selection of probing experiments and the placement of link-level monitoring devices. Additionally, we consider the multiple source model in which the network is probed from different locations.
Statistical Estimation in Network Tomography
Gang Liang, University of California at Irvine, and Bin Yu, University of California at BerkeleyNetwork monitoring and diagnosis are key to improving network performance. The difficulties of performance monitoring lie in today's fast growing Internet, accompanied by increasingly eterogeneous and unregulated structures. Moreover, these tasks become even harder since one cannot rely on the collaboration of individual routers and servers to directly measure network traffic. Even though the aggregative nature of possible network measurements gives rise to inverse problems, existing methods for solving inverse problems are usually computationally intractable or statistically inefficient. In this talk, we discuss a pseudo likelihood approach for solving a group of network tomography problems. It uses the principle of devide-and-conquer to achieve a good balance between the computational complexity and the statistical efficiency of the parameter estimation. Under some general regularity conditions, the consistency and asymptotic normality of the pseudo likelihood estimator are established.
The Statistics of Voice Over the Internet
Jin Cao, Bell Labs Lucent Technologies, and William S Cleveland, Purdue UniversityVoice communications, long provided by the ``public switched telephone network'', is moving rapidly to the Internet under the name Voice over Internet Protocol (VoIP). But little empirical study of VoIP traffic characteristics has been carried out. Data collected from the Global Crossing network are used to study the statistics of call arrivals, call durations, the multiplexed packet arrival point process, silence suppression, and discrimination of attempted calls and connected calls. A critical question about traffic engineering is answered --- how many calls can be put on an Internet link and maintain quality-of-service criteria on jitter and delay.
Organizer: S. Stanley Young
Time: Thursday, August 11, 2005, 8.30-10.20am
This session will deal with the new area of multi-block analysis. This type of situation arises when, for example, there are two or more sets of medical data on patients (such as microarray data and blood chemistry data), but the number of patients is considerably smaller than the number of vairables being measured.
Mining Systems Biology Data
Raymond Lam, GlaxoSmithKlineUnderstanding biology systems is becoming an increasingly important first step in drug discovery. Recent technological advances are making the measurement of biological activity at the molecular level possible. Experimental platforms generate large volumes of data that measure gene translation, proteins and metabolites. Identifying biomarkers specific to different drugs and diseases and understanding the association of these biomarkers will give us a better understanding of drug efficacy and safety and of the predictive nature of animal models.
PLS, GPA and Multi-block analysis methods
Michel Tenenhaus, HEC School of Management, France'A situation where J blocks of variables are observed on the same set of individuals is considered in this paper. A factor analysis logic is applied to tables instead of variables. The latent variables of each block should well explain their own block and, in the same time, the latent variables of same order should be as positively correlated as possible to improve interpretation. In the first part of the paper we describe the hierarchical PLS path model and remind that it allows to recover some usual multi-block analysis methods. In the second part, we suppose that the number of latent variables can be different from one block to another and that these latent variables are orthogonal. PLS regression, PLS path modeling and Generalized Procrustes Analysis are used for this situation. This approach is illustrated by an example from sensory analysis.
Multiblock Relationships in High Dimensions
Douglas M. Hawkins, and Despina Stephan, University of MinnesotaThe familiar canonical correlation formulation deals with the relationship between two vector-valued random variables, and has been extended to settings with three or more vector-valued variables. This setting is increasingly interesting in a number of settings. For example, the members of a pharmaceutical chemical library may be described by a vector of observed clinical effects; a vector of Aspergillus mutagenicity measures; and a vector of molecular descriptors. This leads to the problem of finding linear (or nonlinear) functions of the separate vectors that show the commonality between the three types of measure. Such data sets are typically of very high dimension however, and multiblock methods motivated by social science problems are unsatisfactory without adaptation to address problems such as apparent rank near-deficiencies, which arise in high dimension. We present some adaptations of this sort, and address the related problem of model diagnostics.
Ontology Enhanced Statistical Analysis
Jiajun Liu and Jacqueline M. Hughes-Oliver, North Carolina State University, and Alan Menius, Glaxo Wellcome IncNew systems biology technology platforms and techniques give scientist the ability to measure thousands of biomolecules including genes, proteins, lipids and metabolites. Analyses of these combined data are typically complex resulting in hundreds of statistically significant findings. The potential for type I errors and lack of interpretability can greatly diminish the impact of these experiments. Our goal is to analyze gene expression data using classical statistical methods, guided by domain knowledge as captured in the Gene Ontology (GO). Methods combining existing domain knowledge with classical methods can yield more interpretable results or even improved analysis. Our research reveals this conclusion with various simulations as well as analysis on real data sets.
Roundtable - "The role of statisical science in understanding climate change"
Douglas Nychka, National Center for Atmospheric Research
The influence of human activities on the earth's climate is by now largely undisputed and the potential for even greater changes in this century confront us. The grand challenge for climate science is to understand the complex feedbacks that balance and amplify processes in the earth's climate system and to provide projections of what we might expect for the future. An important use of these climate projections is translating them into economical and societal impacts. Impacts of climate change are useful in making policy and planning decisions that can range from a utility company anticipating future power demand for heating and cooling, to public health department evaluating changes in infectious disease patterns to a country reducing green house gas emissions. To progress from large scale chemical and physical models to results that are useful for a policy or decision maker one must consider a range of spatial and temporal scales and also substantial components of uncertainty. The progression from geophysical models to a finding suitable for policy might be characterized as an end-to-end analysis. At many steps in an end-to-end analysis statistical science has an important role not only in improving climate estimates but also characterizing the uncertainty in such estimates.
Roundtable - "Designing Real Experiments: Tricks of the Trade"
George Milliken, Kansas State University
The process of designing an experiment for a real world situation involves several steps. The first step is getting the researcher to provide a detailed description of the objectives of the experiment. The second step is gaining an understanding of the process the researcher is going to follow to carry out the experiment. The researcher often questions my "need" for information about the study, but I always say "the more I understand about your study and process of carrying out the experiment, the more I can be helpful." The researcher may think he/she knows the type of designed experiment that is required, but always obtain sufficient information to substantiate their thoughts. The discussion will involve the process of tailoring the design of the experiment and statistical analysis to incorporate the process the researcher must follow to carry out the experiment to provide answers to the stated objectives. Important issues are (1) recognizing the need for different sizes of experimental units, (2) is blocking useful or necessary, (3) what is the appropriate block size, (4) what is expected distribution of the data, (5) how should the data be analyzed, and many others.
Roundtable - "Surviving in Industry: Advice for Newcomers"
Fred Hulting, General Mills, Inc.
You completed your degree and landed a great job as a statistician in business or industry...now what? We'll discuss what you can expect, what skills will be helpful, and how you can make the most of your new opportunity. You'll even learn what some seasoned industrial statisticians have in their survival kits.
Roundtable - "Statistics in Internetland"
James Marron, University of North Carolina, Chapel Hill
The analysis and modelling of Internet traffic is an area with a strong need for a wide range of new statistical methodologies. The challenges are particularly acute because the area abounds in non-standard pitfalls such as heavy tailed distributions and long range dependence, which render many standard techniques useless. Even time honored classical notions such as random sampling need re-examination.