Reproducibile workflows

Version Control and Computational Notebooks

John Little

Duke University Libraries

Center for Data & Visualization Sciences

2024-09-27

Article Production

Reproduction



Authoring and computation environment should enable the articulation of scholarship within a reproducible context

Reproducibility Pyramid.  CC BY John Little & Sophia Lafferty-Hess

Reproducibility Pyramid ● Little & Lafferty-Hess (2020)

Features

  • Support composable recombination
  • Accommodate multimedia expression
  • Provide rich reporting expressions
  • Support economical portability and degrade gracefully
  • Support extensibility
  • Ensure transparency
  • Support a documentary-style project history
  • Accommodate change and collaboration
  • Be citable

Three points


  1. Version Control (Git & GitHub)
  2. Notebooks (Literate Coding)
  3. Archiving & Publishing (Zenodo, Containers)

Reproducibility Pyramid.  CC BY John Little & Sophia Lafferty-Hess

Reproducibility Pyramid ● Little & Lafferty-Hess (2020)

Preview



Version Control

Characteristics of version control


  • A system to manage projects (repo)
  • A system to track how computer files change over time
  • A system that supports collaborative revision
  • More than file synchronization
  • Assists in project back-ups

Git

  • Free open source
  • Wildly successful; most broadly implemented
  • In use across the globe
  • Use it on any file system
  • Track any file
  • Use it in any environment

Scalable to project size

Project Repositories

Track change


Branches

GitHub

  • Profile (store and host) git repos
  • Enable collaboration across the globe or private
  • Editorial and fine-grain control

Git + GitHub

Hubs

  • GitHub
  • GitLab
  • BitBucket

Duke specific hubs

  • gitlab.oit.duke.edu (NetID)
  • PACE
  • Anywhere that data and coding happens.

File Distribution and Collaboration

Other project management features


Basic features

Git features implemented for distribution

  • Push
  • Public or Private
  • Clone / Fork
  • Pull Request
  • Pull

Push

Clone

Fork / PR

Summary

  • Git is used to track changes to your repo
  • GitHub is used to distribute your git repo and facilitate collaboration

Notebooks

Reproducibility

  • Do everything with code!

    • Helps reduce repetion errors
    • Helps avoid copy/paste barriers
    • Orchestrate workflows

Computational Notebooks

  • Authoring environment

    • Code chunks interspersed with natural language
    • aka Literate Coding
  • Easy to read and compose

  • Graceful degradation

Reports and Expressions


Report expressions are rendered at code execution



Interactivity and web applications

  • Shiny
  • Flask
  • WebR
  • Plotly Dash
  • ObjservableJS

Quarto Notebook in RStudio

Jupyter Notebooks

Quarto

  • A scientific publishing system
  • R, Python, ObservableJS
  • Compose with standard text editors, or basic IDEs
    • IDEs: RStudio, Jupyter, VSCode, Positron

Rendered Outputs

  • Artifacts that document a body of work
  • Are reproducible and modifiable when data or techniques change
  • Easy to update natural language explanations and re-render outputs
  • Schedule emails based on report parameters

Summary of benefits

  • Using natural language clearly explain data, models, and workflows
  • Reduce dependencies on outside and undocumented steps
  • Ability to expose technical code chunks depending on audience focus
  • State of the art reproducibility
    • 21st century container for evidence-based, computationally-processed research

Analysis &
Visualization

Analysis

Visualization

Use graphics tools predicated on the grammar of graphics

Reporting


Report expressions are rendered at code execution



Archive & Publish

GitHub example

Types of repositories

Archival

Posterity of milestones


Workflow

Versions / evolution of project

}

How

  • Generate report expressions from code
  • Combine GitHub releases with Zenodo to archive your milestones and share the interactive computation in a binder Hub
  • Zenodo: general, open repository to deposit research papers, data sets, code, reports and related artifacts and connect to a citable DOI.
  • Binder: package and share reproducible computational environments
    • mybinder.org (public BinderHub portal)

Steps

  1. Make a GitHub Release at project milestone(s)
  2. Connect GitHub to Zenodo
    1. Mint a DOI to a GitHub Release (persistent identifier: citation; milestones)
    2. With DOI, link to ORCID
  3. Create a publicly launchable, fully functional computation container of your work

End to End

  1. Project with version control
    • project folder with TIER organization
    • data (raw)
    • version control (git)
  2. Coding notebook
    • data cleaning
    • natural language explanations
    • analysis and modeling
    • visualization
    • generate report expressions from code
  1. Publish
    • workflow archived and collaboration enabled via Git; shared through GitHub / GitLab etc.
    • Milestones linked to GitHub releases; DOIs minted; Posterity archiving at archival repositories (e.g. Zenodo)
    • Informal: web, file sharing, etc.
      • Whitepapers, slides, dashboards, etc.
    • Formal: vis-a-vis peer reviewed journal arcticles

Sharing your workspace


Your computation workspace (i.e. your laptop, desktop, cloud)

Give someone else your laptop so they can play around with your projects

  • the code, the data, the settings and configurations?
  • Good idea?

Now you can share a copy of your computational environment

Binder Hub

  • Easiest: mybinder.org open and public
    • quarto use binder
  • Security demands may push you to use singularity

Container Examples

Repeat for the PDF

End to End - Steps

  1. Project with version control
    • project folder with TIER organization
    • data (raw)
    • version control (git)
  2. Coding notebook
    • data cleaning
    • natural language explanations
    • analysis and modeling
    • visualization
    • generate report expressions from code

End to End - Steps (continued)

  1. Publish
    • workflow archived and collaboration enabled via Git; shared through GitHub / GitLab etc.
    • Milestones linked to GitHub releases; DOIs minted; Posterity archiving at archival repositories (e.g. Zenodo)
    • Informal: web, file sharing, etc.
      • Whitepapers, slides, dashboards, etc.
    • Formal: vis-a-vis peer reviewed journal arcticles