JSDSE Focuses on Teaching Reproducibility,
Responsible Workflow

Aneta Piekut, Rohan Alexander, Colin Rundel, Micaela Parker, and Nicholas J. Horton

Many new principles and standards have been developed to facilitate cultural changes in fostering reproducible research, but less so has been done in teaching. Articles in the November 2022 issue of the American Statistical Association’s open-access Journal of Statistics and Data Science Education present how to integrate practices for achieving reproducibility into teaching data science and statistics. The 11 papers and accompanying editorial discuss how to teach reproducibility and responsible workflow from four perspectives

  1. Refreshing and organizing teaching materials
  2. Providing guidelines for student work
  3. Engaging students in editorial work
  4. Revising curriculum at the program level

“The Growing Importance of Reproducibility and Responsible Workflow in the Data Science and Statistics Curriculum,” by Nicholas J. Horton, Rohan Alexander, Aneta Piekut, and Colin Rundel, motivates the issue by describing how teaching the data analysis cycle requires knowledge of reproducibility and workflow that has not historically been at the center of statistics and data science education and advocates for the inclusion of these topics in the curriculum.

“An Invitation to Teaching Reproducible Research: Lessons from a Symposium” by Richard Ball, Norm Medeiros, Nicholas W. Bussberg, and Aneta Piekut, summarizes key messages from the symposium Project TIER, held in 2021. The 10 talks showcased examples of students benefiting in multiple ways from teaching reproducible methods on top of the statistical training: improving skills in computation, data management, and documentation that are transferable for research jobs and beyond; gaining confidence in analytical and interpretive skills; and broadening their intellectual development.

“Interdisciplinary Approaches and Strategies from Research Reproducibility 2020: Educating for Reproducibility,” by Melissa L. Rethlefsen, Hannah F. Norton, Sarah L. Meyer, Katherine A. MacWilkinson, Plato L. Smith II, and Hao Ye, reports on a virtual conference dedicated to teaching reproducible research. They thematically analyzed the conference content and identified trans/interdisciplinary themes, including lifelong learning, cultivating bottom-up change, “sneaking in” learning, just-in-time learning, targeting learners by career stage, learning by doing, learning how to learn, establishing communities of practice, librarians as interdisciplinary leaders, teamwork skills, rewards and incentives, and implementing top-down change, along with key lessons for each.

“Data Science Ethos Lifecycle: Interplay of Ethical Thinking and Data Science Practice,” by Margarita Boenig-Liptsin, Anissa Tanweer, and Ari Edmundson, notes that data science is part of the social world with the potential to significantly affect (for better or worse) individuals and communities. Instructors, learners, and researchers are encouraged to consider the ethical dimensions of their practice. The Data Science Ethos Lifecycle tool was created to facilitate reflection on how social context interplays with data science work and what might be social consequences of the final products. The authors conclude that workflow is only responsible if ethical reflections are present at each stage of research.

“Opinionated Practices for Teaching Reproducibility: Motivation, Guided Instruction, and Practice,” by Joel Ostblom and Tiffany Timbers, observes that while it is relatively easy to engage statistics and data science students in data analysis/project tasks—as they are driven by curiosity to discover new patterns—it is more difficult to do so when teaching a reproducible workflow. The solution suggested in this paper is to work on student motivation.

“Tools and Recommendations for Reproducible Teaching,” by Mine Dogucu and Mine Çetinkaya-Rundel, shares the premise that if our teaching materials (raw data, lecture slides, videos, exercises, etc.) are clearly organized, workflow and links between various management systems are easy to follow and all materials are available via a version control system and built-in Markdown notebooks. They give an example for how students can document and share their work, as well as professionally report it.

“Third Time’s a Charm: A Tripartite Approach for Teaching Project Organization to Students,” by Christina Mehta and Renee’ Moore, reflects on three interactions of a statistical course and how students are guided to collaborate. The foundation of successful collaboration is transparently and neatly organized data documentation—a transferable skill pointed to by many contributions in this issue.

“LUSTRE: An Online Data Management and Student Project Resource,” by John Towse, Rob Davies, Ellie Ball, Rebecca James, Ben Gooding, and Matthew Ivory, describes a system to engage students with best practices for open research by allowing them to experience different phases of reproducible research. They describe the LUSTRE package, which promotes good data management practices, enables the delivery of key concepts in open research, and organizes and showcases project work.

“Teaching for Large-Scale Reproducibility Verification,” by Lars Vilhuber, Hyuk Harry Son, Meredith Welch, David N. Wasser, and Michael Darisse, describes an innovative pedagogical, research-led approach in which students are involved in the editorial work of the journals published by the American Economic Association. Students check completeness of replication materials and computational reproducibility of the code. They also have a chance to work across many coding languages to understand the workflow.

“Collaborative Writing Workflows in the Data-Driven Classroom: A Conversation Starter,” by Sara Stoudt, reviews the use of reproducible tools (e.g., R Markdown and computational notebooks) to allow individual students to create reproducible research outputs, while noting that collaborative approaches are less often used. This is in stark contrast to how data science projects are done in real life. Stoudt discusses two workflow strategies that can be used in teaching reproducible research to students and that require students to delegate tasks (e.g., chunks of code), communicate to discuss changes, and integrate data.

“A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database,”  by Dewi Amaliah, Dianne Cook, Emi Tanaka, Kate Hyde, and Nicholas Tierney, motivates the importance of preparing reproducible materials as a way to refresh teaching materials based on data sets that are often updated. This approach is attractive in part because it can serve as an example for reproducible standards expected in student work.

“Approachable Case Studies Support Learning and Reproducibility in Data Science: An Example from Evolutionary Biology,” by Luna L. Sanchez Reyes and Emily Jane McTavish, explores the question of how we communicate open access materials and how they relate to the real world outside of narrow data science silos. They find that even if code and data are published online, language used in replication materials might be too complex to clearly understand. They identify barriers in accessibility of research workflows and discuss how to make them more available to a general audience.