3 Hands-on: reshape
Reshaping Your Data
For this project you will take data in one format and reshape into a more optimal format for analysis. We’ll assume the analysis will be done in another tool. OpenRefine can export data in multiple formats after the data has been wrangled into the proper structure. In this lab we continue exploring how OpenRefine can import various data formats, this time Excel.
Goals
1. Import Excel worksheets
2. Faceting & Filtering for select data cleaning
3. Transforming: Adding columns and using fill-down to fill in data
The changes outline above are minor changes to the shape of the data. The content of the data remains unchanged. Basically you will transform the data to remove confusion in the first column. You can see how the data will change by looking at the before and after examples, below. In the before example (table 3.1) there are different variable concepts in the first column: team and player. In the after example (table 3.2) we move each concept (variable) into its own column. I encourage you to download the Excel spreadsheet and look through the raw data. 9 Doing so will help you see how the data transformation progresses.
Before
Cartier Martin | 876332 | c$754,250, r1/7, s2/1-2/18, s2/21 |
Mike Scott | 778872 | minimum |
Mike Muscala | 57668 | signed 2/27, released 3/23 |
James Nunnally | 57668 | signed 1/11, released 2/1 |
Dexter Pittman | 53888 | c $52,017, signed 2/22, rel 2/27 |
NA | NA | NA |
—- | NA | NA |
NA | NA | NA |
Boston Celtics | NA | NA |
Kris Humphries | 12000000 | NA |
After
Player | Team | Salary | Notes |
---|---|---|---|
Cartier Martin | Atlanta Hawks | 876332 | c$754,250, r1/7, s2/1-2/18, s2/21 |
Mike Scott | Atlanta Hawks | 778872 | minimum |
Mike Muscala | Atlanta Hawks | 57668 | signed 2/27, released 3/23 |
James Nunnally | Atlanta Hawks | 57668 | signed 1/11, released 2/1 |
Dexter Pittman | Atlanta Hawks | 53888 | c $52,017, signed 2/22, rel 2/27 |
Kris Humphries | Boston Celtics | 12000000 | NA |
Rajon Rondo | Boston Celtics | 12000000 | NA |
Gerald Wallace | Boston Celtics | 10105855 | NA |
Jeff Green | Boston Celtics | 8700000 | NA |
Brandon Bass | Boston Celtics | 6450000 | NA |
3.1 Ingest Excel data
- Import Data
https://github.com/libjohn/openrefine/raw/master/data/salary.xlsx
- You many want to give your project a pretty title
- Parse data as
- Worksheets to Import:
- Check “2013-2014 666 rows”
- UnCheck “2014-2015 599 rows”
- UnCheck “Store blank rows”
- Notice lines 20 & 22 disappear
- Notice lines 20 & 22 disappear
- UnCheck “Parse Next” 1 lines as column headers
- Notice the “Atlanta Hawks” are no longer the column header for the first column
- Notice the “Atlanta Hawks” are no longer the column header for the first column
- Worksheets to Import:
- Project name =
salary data
- Rename Columns: “Player”, “Salary”, “Notes” 10
- Show as: rows to ‘25’ (notice row 21)
3.2 Facets
- Remove all rows where ‘—-’ exist in the Player column
- Sort by: count > click: ‘—-’ :
- You should now have 29 matching rows that begin ‘—-’
- Sort by: count > click: ‘—-’ :
- Click: Remove All in the Facet/Filter sidebar
Notice: in the next step, team names precede each team roster and are followed by two blank cells in the same row. Scroll through the screens (Click “next >”) a few times; return to the first screen
- Make a column for team name and fill it.
- Isolate team-name rows using a facet on the blank cells in the Salary column
- Why do you think this doesn’t work?
Those cells are actually filled with whitespace – invisible to the naked eye but legitimate characters for a computer: a space (
\s
). They’re not actually blank and only appear blank. So, regular expressions to the rescue!
- Close the facet
\s
means “a space”;^
means “must begin the line”
- Mouseover the the “Cleveland Cavaliers” Salary cell:
- edit > highlight all the text EXCEPT the first space: “Tot $66,611,520” > <<cut to clipboard>> > Apply
- BE SURE to leave a blank space where the Salary data was
- edit > highlight all the text EXCEPT the first space: “Tot $66,611,520” > <<cut to clipboard>> > Apply
- Edit the individual Notes cell for the “Cleveland Cavaliers” cell:
- edit > <<paste from clipboard>> “Tot $66,611,520” > Apply
- This time you do not need a leading blank space
- edit > <<paste from clipboard>> “Tot $66,611,520” > Apply
- Isolate team-name rows using a facet on the blank cells in the Salary column
3.3 Text Filter
- Add the team name as a new column for each player then remove team name from the Player Column
- Remove All facets
- Close (or “X out”) the Salary text filter