R for Lunch

Import data and install RStudio / Tidyverse

John Little

Duke University Libraries

Center for Data & Visualization Sciences

2024-09-12

Today’s topics

  • How to import data

  • Tour of RStudio IDE

  • Coding notebooks

Preceded by where to download RStudio and R

Housekeeping

  • Drew / Lauren / breakout rooms
  • CDVS
    • Themes
      • Data Management (Plans, Reproducibility, Repositories)

      • Data Science

      • Data Visualization

      • GIS and Spatial Analysis

      • Data Sources

Housekeeping continued

R for Lunch as a series

R for Lunch is a series that meets 8 times (till EOM Oct.) After today it will meet regularly on Thursdays at noon.

  • Sign-up for each workshop individually

  • Each episode has a unique zoom link

Eat your own dog food


Model how R can work for practical reproducible workflows

Definitions

R/Tidyverse/Quarto


R/Tidyverse/Quarto represents the state of the art for practical reproducibility

R & RStudio


R is a data-first programming language


RStudio is an IDE

Reproducibility


  • Independently and transparently achieve reliable results with the same data and the same workflow
    • Transparency with reproducible workflows
  • Best workflow and ecosystem to achieve reproducible work is to “do everything with code
    • Import data, analyze, visualize, and publish/share

Tidyverse

  • An opinionated set of packages for data manipulation and analysis

  • A meta-package of eight symbiotic packages

Packages

  • Extend R into your subject domain

  • And/or make it easier to accomplish a computational task

  • There are thousands

    • MetaCRAN, CRAN, BioConductor, GitHub

Quarto

works with R and Python

  • A scientific publishing system (workflow)
    • dashboards, manuscripts, MSWord, slides, website, e-book, PDF
  • Coding Notebooks: Code chunks interspersed with explanatory text (Natural language)
    • Render reproducible, shareable reports
  • A next-gen (or modern) Markdown

Quarto notebook

side-by-side view of a Quarto editor and rendered report expression

A side-by-side view of a Quarto editor and rendered report expression

Opinionated

 

Tidyverse and Quarto is the most practical and developed, reproducible, scientific analysis and publishing workflow available.

Tidy data

Tidy data1

Tidy data

  • Every row is a single observation
  • Every column is a variable
  • The cells are single data values

Wide data

Code
library(tidyverse)
library(gt)
library(gtExtras)

tidyr::relig_income |> 
  gt::gt_preview() |> 
  gtExtras::gt_theme_dark()
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don't know/refused
1 Agnostic 27 34 60 81 76 137 122 109 84 96
2 Atheist 12 27 37 52 35 70 73 59 74 76
3 Buddhist 27 21 30 34 33 58 62 39 53 54
4 Catholic 418 617 732 670 638 1116 949 792 633 1489
5 Don’t know/refused 15 14 15 11 10 35 21 17 18 116
6..17
18 Unaffiliated 217 299 374 365 341 528 407 321 258 597

Tall data

Code
relig_income |> 
  pivot_longer(cols = -religion, 
               names_to = "income_category", 
               values_to = "income") |> 
  gt::gt_preview() |> 
  gtExtras::gt_theme_dark()
religion income_category income
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6..179
180 Unaffiliated Don't know/refused 597
Code
relig_income |> 
  pivot_longer(cols = -religion, 
               names_to = "income_category", 
               values_to = "income") |> 
  mutate(religion = fct_relevel(religion, "Evangelical Prot", "Mainline Prot", "Catholic", "Unaffiliated", "Historically Black Prot")) |> 
  mutate(income_category = fct_rev(as_factor(income_category))) |>
  ggplot(aes(income, income_category)) +
  geom_col(fill = "#eee8d5") +
  facet_wrap(vars(
    fct_other(
      religion, 
      keep = c("Evangelical Prot", "Mainline Prot", "Catholic", "Unaffiliated", "Historically Black Prot")))) +
  theme(plot.background = element_rect(fill = "#002b36"),
        text = element_text(color = "#eee8d5"),
        axis.text = element_text(color = "#eee8d5"), 
        panel.background = element_rect(fill = "#002b36"),
        panel.grid = element_line(color = "#002b36"),
        strip.background = element_rect(fill = "#7b9c9f"))

Code

 

relig_income |> 
  pivot_longer(cols = -religion, names_to = "income_category") |> 
  ggplot(aes(value, income_category)) +
  geom_col() +
  facet_wrap(vars(religion))

Image Credit: apreshill | CC BY 4.0 | https://github.com/apreshill/teachthat/blob/master/pivot/pivot_longer_smaller.gif]

Polls

Grammar (data and graphics)

By next week you’ll have the basic building blocks to

  • Leverage reproducible data workflows: import data, analyze data, and generate visualizations.

Along the way

  • Rendering reproducible reports (Quarto)

  • Practical techniques

  • Pro-tips that comprise a fluency of reproducible data analysis

We are here to help

  • askData@duke.edu

  • https://library.duke.edu/data

  • https://is.gd/littleconsult

Let’s do it

Three things for today

  • Tour of the RStudio IDE (Projects)

  • How to import data

  • Coding notebooks

Exercises

  1. https://intro2r.library.duke.edu/ > Exercises > Link out > Green Code button > Download ZIP

  2. Then, Unzip (i.e. Expand) the folder (on your local file system)

  3. Then, double click the rforlunch_exercises.Rproj file

  4. From RStudio the Files tab, open the 00_import_answers.qmd

    • The answer file is in the RStudio rforlunch_exercises project > Files Tab > Answers folder

Closing

Pipes and Assignments

 

Operator Operator Name Keystroke shortucts Pnuemonic
<- assignment Alt-dash “Gets value from”

|>
or

%>%

pipe Ctrl-Shift-M “And then”

Citation management

 

RStudio > Quarto Notebook > Insert > Citation

Example DOI: 10.18637/jss.v059.i10

ai-paired coding

 

  • Data science concepts: Microsoft copilot (“More precise” setting)

  • Code completion: GitHub copilot and RStudio (IDE) or VSCode (IDE)

Bye for now