Impact of data analysis algorithm choice on RNA-seq gene expression estimation and downstream gene-based prediction
*May D. Wang, Emory-Georgia Tech Cancer Nanotechnology Center 


As RNA-seq technologies mature, the choice of data analysis has become a critical challenge in clinical application. The FDA-led Sequencing Quality Control (SEQC) Consortium has conducted a comprehensive investigation of 278 representative RNA-seq data analysis pipelines to determine the impact of algorithms on many aspects of gene expression output and summaries such as reproducible expression estimation in comparison to qPCR reference data, repeatable expression estimation for technical replicates, detection of low-expressing genes, detection of differentially expressed genes, and RNA-seq-based predictive models in clinical settings. Results reveal that the gene expression quality and the downstream prediction vary significantly with pipeline components such as mapping, quantification, and normalization. This study established a general guideline for selecting safe RNA-seq data analysis pipelines to assist clinicians or bioinformaticians in achieving improved biological utility, reproducibility, repeatability, and effectiveness in decision making.