Friday, February 21
CS07 Big Data in the Real World Fri, Feb 21, 11:00 AM - 12:30 PM
Bayshore I

Working with Complex Sizeable (i.e., Gigabyte) Data on a PC: A Case Study (302718)

*Pete Michael Sherick, Lubrizol Corporation 

Keywords: large data, applied statistics, statistical engineering, SAS, R, Lubrizol, regression

On its grandest scale, machine learning, data mining, and cloud/distributed computing techniques are changing the way we gather and process information. This is the Big Data revolution that is transforming marketing, health care, banking, industry, manufacturing, government, and countless other sectors. On a smaller scale, as the computational power and storage capacity of personal computers have increased in the past decade, so has the potential for larger and larger statistical analysis. However, analysis of even a few gigabytes of data brings a number of obstacles to extracting meaningful and usable information. This talk will feature an example analysis of kinematic viscosity results for more than 100,000 automotive lubricant formulations encompassing more than 10,000 components. I will discuss our current process, including retrieval, formatting, cleaning, analyzing, and ultimately uploading the resulting model into user-accessible applications. Issues statisticians may run into with data of this size and possible solutions will be suggested. SAS and R software packages will be the primary focus and their respective benefits and limitations will be discussed.