JSM 2012 Home

JSM 2012 Online Program

The views expressed here are those of the individual authors and not necessarily those of the JSM sponsors, their officers, or their staff.

Online Program Home
My Program

Abstract Details

Activity Number: 338
Type: Contributed
Date/Time: Tuesday, July 31, 2012 : 10:30 AM to 12:20 PM
Sponsor: Section on Statistical Learning and Data Mining
Abstract - #305924
Title: Text Classification and Big Data
Author(s): David Afshartous*+ and George Michailidis
Companies: Vanderbilt University and University of Michigan
Address: Department of Biostatistics, Nashville, TN, 37232-2158, United States
Keywords: machine learning ; distributed computing ; text analytics ; data mining ; text categorization
Abstract:

The problem of text classification is central to many businesses in the information age where massive amounts of relevant data are readily available. Classic examples include e-mail spam, customer sentiment, diagnosis from electronic medical records (EMRs), and legal discovery document classification. Text classification may be viewed as a multi-step process that begins with transforming unstructured data into a structured format, e.g., a document-term matrix where rows represent documents and columns represent document features. Supervised learning methods may then be applied, where subsequent steps require many decisions and guidance for such decisions often differs between research domains. Such decisions include feature selection, dimensionality reduction, training set size, algorithm selection, and error analysis. In this paper, we consider text classification from the perspective of big data, where data size is affected by both the number of documents and the number of features employed by the learning algorithm. We assess the impact of big data on each step of text classification, and offer suggestions in the context of both open source and commercial software options.


The address information is for the authors that have a + after their name.
Authors who are presenting talks have a * after their name.

Back to the full JSM 2012 program




2012 JSM Online Program Home

For information, contact jsm@amstat.org or phone (888) 231-3473.

If you have questions about the Continuing Education program, please contact the Education Department.