Cleaning Data with OpenRefine

Before

Table 3.1: Unprocessed: A selective 10 rows of the 2013-2014 Salary.xls data
Cartier Martin	876332	c$754,250, r1/7, s2/1-2/18, s2/21
Mike Scott	778872	minimum
Mike Muscala	57668	signed 2/27, released 3/23
James Nunnally	57668	signed 1/11, released 2/1
Dexter Pittman	53888	c $52,017, signed 2/22, rel 2/27
NA	NA	NA
—-	NA	NA
NA	NA	NA
Boston Celtics	NA	NA
Kris Humphries	12000000	NA

After

Table 3.2: Wrangled: same 10 rows of the 2013-2014 Salary.xls data
Player	Team	Salary	Notes
Cartier Martin	Atlanta Hawks	876332	c$754,250, r1/7, s2/1-2/18, s2/21
Mike Scott	Atlanta Hawks	778872	minimum
Mike Muscala	Atlanta Hawks	57668	signed 2/27, released 3/23
James Nunnally	Atlanta Hawks	57668	signed 1/11, released 2/1
Dexter Pittman	Atlanta Hawks	53888	c $52,017, signed 2/22, rel 2/27
Kris Humphries	Boston Celtics	12000000	NA
Rajon Rondo	Boston Celtics	12000000	NA
Gerald Wallace	Boston Celtics	10105855	NA
Jeff Green	Boston Celtics	8700000	NA
Brandon Bass	Boston Celtics	6450000	NA

3.1 Ingest Excel data

Import Data
- Create Project > Web Addresses (URLs) > https://github.com/libjohn/openrefine/raw/master/data/salary.xlsx
- Next >>
- You many want to give your project a pretty title
- Parse data as
  1. Worksheets to Import:
    - Check “2013-2014 666 rows”
    - UnCheck “2014-2015 599 rows”
  2. UnCheck “Store blank rows”
    - Notice lines 20 & 22 disappear
  3. UnCheck “Parse Next” 1 lines as column headers
    - Notice the “Atlanta Hawks” are no longer the column header for the first column
- Project name = salary data > Create Project
Rename Columns: “Player”, “Salary”, “Notes” ¹⁰
Show as: rows to ‘25’ (notice row 21)

3.2 Facets

Remove all rows where ‘—-’ exist in the Player column
- Player > Facet > Text Facet** >
  - Sort by: count > click: ‘—-’ :
  - You should now have 29 matching rows that begin ‘—-’
- All > Edit rows > Remove all matching rows
- Click: Remove All in the Facet/Filter sidebar

Notice: in the next step, team names precede each team roster and are followed by two blank cells in the same row. Scroll through the screens (Click “next >”) a few times; return to the first screen

Make a column for team name and fill it.
- Isolate team-name rows using a facet on the blank cells in the Salary column
  1. Salary > Facet > Customized facets > Facet by blank
  2. Close the facet
- Salary > Text filter > check “regular expression” > ^\s
  - \s means “a space”; ^ means “must begin the line”
- Mouseover the the “Cleveland Cavaliers” Salary cell:
  - edit > highlight all the text EXCEPT the first space: “Tot $66,611,520” > <<cut to clipboard>> > Apply
  - BE SURE to leave a blank space where the Salary data was
- Edit the individual Notes cell for the “Cleveland Cavaliers” cell:
  - edit > <<paste from clipboard>> “Tot $66,611,520” > Apply
  - This time you do not need a leading blank space

3.3 Text Filter

Add the team name as a new column for each player then remove team name from the Player Column
1. Player > Edit column > Add column based on this column … > New column name = Team > OK
2. Remove All facets
3. Team > Edit cells > Fill down
4. Salary > Text filter > check “regular expression” > ^\s
5. All > Edit rows > Remove all matching rows
6. Close (or “X out”) the Salary text filter

3 Hands-on: reshape

Before

After

3.1 Ingest Excel data

3.2 Facets

3.3 Text Filter