Online Program
Modeling Multiple Categorical Measurements Using Linear Latent Structure Analysis Keywords: Multidimensional categorical data, demographic surveys, latent analysis, health state Linear Latent Structures (LLS) analysis is used to analyze highdimensional categorical data. An abundance of such data appears i) in behavior science, especially in demographic surveys and ii) in genetic studies (wholegenome microarray data). The LLS analysis assumes that measurements reflect a hidden property of subjects that can be described by lowdimensional random vector. This vector is interpreted as explanatory variables which can shed light on the mutual correlations observed in measured categorical variables. The LLS analysis is used to discover this hidden property and describe it as precisely as possible. In this report we discuss the formulation of the LLS model, its statistical properties, algorithm to estimate model and its implementation, simulation studies, and application of LLS model to the National Long Term Care Survey data. We also discuss relationship between LLS and Grade of Membership analysis. Basic steps of LLS analysis include i) determining the dimensionality of the explanatory vector, ii) identifying the linear subspace which explanatory vector ranges over, iii) choosing a basis in the indicated subspace using methods of cluster analysis and/or prior knowledge of the phenomenon of interest, iv) calculating empirical distributions of the socalled LLS scores which reflect individual responses in the linear subspace, and v) investigating properties of the LLS score distributions to capture population and individual effects (e.g., heterogeneity). Simulation studies demonstrate the quality of reconstruction of the major model components (i.e., lowdimensional subspace and the LLS scores distribution). Results of the simulation studies prove the sufficient quality of reconstruction for typical sample size and demonstrate the potential of the methodology to analyze survey datasets with 1000 or more questions. This methodology was applied to the 1994 and 1999 NLTCS datasets (5,000+ individuals) with responses to over 200 questions on behavior factors, and selfreported functional status and comorbidities. We estimated subspace that carries latent vectors and obtain interpretation of its basis as “puretype individuals” (like healthy, strongly disabled, having chronic diseases, etc). Estimated distribution of the LLS scores discovers heterogeneity structure of the population. The components of the vectors of individual LLS scores are used to make predictions of individual lifespans.
