3 Hands-on: reshape

Reshaping Your Data

For this project you will take data in one format and reshape into a more optimal format for analysis. We’ll assume the analysis will be done in another tool. OpenRefine can export data in multiple formats after the data has been wrangled into the proper structure. In this lab we continue exploring how OpenRefine can import various data formats, this time Excel.

Goals
1. Import Excel worksheets
2. Faceting & Filtering for select data cleaning
3. Transforming: Adding columns and using fill-down to fill in data

The changes outline above are minor changes to the shape of the data. The content of the data remains unchanged. Basically you will transform the data to remove confusion in the first column. You can see how the data will change by looking at the before and after examples, below. In the before example (table 3.1) there are different variable concepts in the first column: team and player. In the after example (table 3.2) we move each concept (variable) into its own column. I encourage you to download the Excel spreadsheet and look through the raw data. 9 Doing so will help you see how the data transformation progresses.

The video demonstrates how the data will be transformed in this exercise.

Before

Table 3.1: Unprocessed: A selective 10 rows of the 2013-2014 Salary.xls data
Cartier Martin 876332 c$754,250, r1/7, s2/1-2/18, s2/21
Mike Scott 778872 minimum
Mike Muscala 57668 signed 2/27, released 3/23
James Nunnally 57668 signed 1/11, released 2/1
Dexter Pittman 53888 c $52,017, signed 2/22, rel 2/27
NA NA NA
—- NA NA
NA NA NA
Boston Celtics NA NA
Kris Humphries 12000000 NA

After

Table 3.2: Wrangled: same 10 rows of the 2013-2014 Salary.xls data
Player Team Salary Notes
Cartier Martin Atlanta Hawks 876332 c$754,250, r1/7, s2/1-2/18, s2/21
Mike Scott Atlanta Hawks 778872 minimum
Mike Muscala Atlanta Hawks 57668 signed 2/27, released 3/23
James Nunnally Atlanta Hawks 57668 signed 1/11, released 2/1
Dexter Pittman Atlanta Hawks 53888 c $52,017, signed 2/22, rel 2/27
Kris Humphries Boston Celtics 12000000 NA
Rajon Rondo Boston Celtics 12000000 NA
Gerald Wallace Boston Celtics 10105855 NA
Jeff Green Boston Celtics 8700000 NA
Brandon Bass Boston Celtics 6450000 NA

3.1 Ingest Excel data

  1. Import Data
    • Create Project > Web Addresses (URLs) > https://github.com/libjohn/openrefine/raw/master/data/salary.xlsx
    • Next >>
    • You many want to give your project a pretty title
    • Parse data as
      1. Worksheets to Import:
        • Check “2013-2014 666 rows”
        • UnCheck “2014-2015 599 rows”
      2. UnCheck “Store blank rows”
        • Notice lines 20 & 22 disappear
      3. UnCheck “Parse Next” 1 lines as column headers
        • Notice the “Atlanta Hawks” are no longer the column header for the first column
    • Project name = salary data > Create Project
  2. Rename Columns: “Player”, “Salary”, “Notes” 10
  3. Show as: rows to ‘25’ (notice row 21)

3.2 Facets

  1. Remove all rows where ‘—-’ exist in the Player column
    • Player > Facet > Text Facet** >
      • Sort by: count > click: ‘—-’ :
      • You should now have 29 matching rows that begin ‘—-’
    • All > Edit rows > Remove all matching rows
    • Click: Remove All in the Facet/Filter sidebar

Notice: in the next step, team names precede each team roster and are followed by two blank cells in the same row. Scroll through the screens (Click “next >”) a few times; return to the first screen

  1. Make a column for team name and fill it.
    • Isolate team-name rows using a facet on the blank cells in the Salary column
      1. Salary > Facet > Customized facets > Facet by blank
          • Why do you think this doesn’t work?
          • Those cells are actually filled with whitespace – invisible to the naked eye but legitimate characters for a computer: a space (\s). They’re not actually blank and only appear blank. So, regular expressions to the rescue!

      2. Close the facet
    • Salary > Text filter > check “regular expression” > ^\s
      • \s means “a space”; ^ means “must begin the line”
    • Mouseover the the “Cleveland Cavaliers” Salary cell:
      • edit > highlight all the text EXCEPT the first space: “Tot $66,611,520” > <<cut to clipboard>> > Apply
      • BE SURE to leave a blank space where the Salary data was
    • Edit the individual Notes cell for the “Cleveland Cavaliers” cell:
      • edit > <<paste from clipboard>> “Tot $66,611,520” > Apply
      • This time you do not need a leading blank space

3.3 Text Filter

  1. Add the team name as a new column for each player then remove team name from the Player Column
    1. Player > Edit column > Add column based on this column … > New column name = Team > OK
    2. Remove All facets
    3. Team > Edit cells > Fill down
    4. Salary > Text filter > check “regular expression” > ^\s
    5. All > Edit rows > Remove all matching rows
    6. Close (or “X out”) the Salary text filter

  1. The data, “Salary.xslx”, are NBA salary data.

  2. Column Header > Edit Column > Rename this column