Cleaning Data with OpenRefine

1.1 Getting Started

We’ll use a subset⁴ of Raleigh Building Permits data

Launch the Open-Refine icon from your computer (find and double-click the jewel icon.)
- Installations / Start / Stop instructions
- Owen Stephens’s helpful video illustrating installation
- Remember: The User Interface for Refine is Chrome or Firefox
  - If your default browser is one of these, Refine will auto-launch to http://127.0.0.1:3333
  - If your default browser is IE, you’ll need to open the following URL http://127.0.0.1:3333 in Chrome or Firefox
Create Project > Web Addresses (URLs) > https://raw.githubusercontent.com/libjohn/openrefine/master/data/subset-RBP-narrow.csv
Click Next >>
Select: Columns are separated by “commas (CSV)”
Change the Project Name to Raleigh Building Permits and click Create Project >> (top-right)

1.2 Shutting Down OpenRefine

It’s IMPORTANT to properly shutdown the application. OpenRefine will automatically save your project as you transform your data. However, in my experience your last operation may have to be manually saved by following the procedures below…

Windows: Control-C Mac: Click the OR app in the doc, invoke Quit

NOTE: It is possible, but not guaranteed, to lose data if you don’t follow the rather unintuitive shutdown procedures. Better safe than sorry.

1.3 Facets & Cluster

Mass Editing

It’s important to understand OpenRefine was designed to transform data in bulk. It is possible to edit single data cells but it is not as convenient as some other, more WYSWIG, tools. This exercise will help you learn how to accomplish these kinds of mass data transformations

Make a Text facet on the work_type_description column
There are two facets for new buildings: “NEW BUILDING” and “New Building”.
Select “NEW BUILDING” facet, limiting to 3 matching rows. To the right of the “NEW BUILDING” facet, hover your mouse over the “edit” feature; click “edit” and alter the text to title case: “New Building” ; click Apply
Mass edit “OTHER” & “Other” so they have the same value
Mass edit “ALTERATIONS/REPAIRS” and “Alterations/repairs” so they have the same value
Click “Remove All” to remove the facet window

1.4 Split data in cells

address > Edit column > Split into several columns…
- Separator = ( > OK
address 2 > Edit column > Split into several columns…
- Separator = , (i.e. accept default and click) > OK
address 2 1 > Edit column > rename this column
- latitude
address 2 2 > Edit column > rename this column
- longitude

(more data transformation could be done, but let’s move on for now…)

1.5 Concatenate cells together

square_feet > Edit column > Add column based on this column…
1. New column name = Full Description
2. Expression = value + cells["proposed_work"].value

The last step adds two columns together, but the preview screen is hard to read. Make it readable by using the next expression instead …

Expression = value + " sq ft. " + cells["proposed_work"].value > OK

1.6 Search & Replace, Plus More

Looking at the latitude and longitude cells, one column appears in green text (indicating OpenRefine considers data those cells as numbers) and one column appears in black with a closing parenthesis in the last position. Convert both columns to text, trim leading and trailing spaces, and then find and replace the parenthesis

Convert Data Types

latitude > Edit cells > common transformations > To text

Remove Whitespace

longitude > Edit cells > common transformations > Trim leading and trailing whitespace

Search & Replace

Search and Replace is commonly performed as a data transformation using the following function: value.search("old text","new text"). In the example below we replace a closing parenthesis with nothing, effectively removing the trailing parenthesis. The example may appear strange since the replace function exists within a set of parenthesis. Remember the text you are replacing is idnetified within the first set of quotation marks You will identify as the replacing text within the second set of quotation marks. I’ve draw red circles around the function, as well as the before and after text preview to clarify how the process will work.

longitude > Edit cells > Transform…
- Expression = value.replace(")","")

1.7 Web Scraping

Select a subset

We want to gather the FIPS code for a subset of the data. The government server returns data in a JSON format so we’ll parse the data after we retrieve it. First we’ll subset our dataset for expediency. This limits our waiting time during the workshop.

issue_date > Facet > Custom text facet…
- expression = value.slice(6,10)
select the “2014” facet
authorized_work > Facet > Text facet
select the “3 SEASON ROOM” facet

You should now have 6 matching rows.

API

Now let’s fetch the data from an API made available via the National Broadband Map. This API returns a FIPS code if we give it a county name (or in this case, even a partial county name.)

fetch JSON data from the National Broadband Map. We’ll use the API documentation for Geography by Name API which returns Census geography for a geography name (e.g. Durham)
- Documentation
  - The documentation informs us that the format of the URL we want to construct is as follows: http://www.broadbandmap.gov/broadbandmap/census/county/durh?format=json
  - Notice the data values in the “county” column. All we do is construct a URL which calls the value of the cells from each row of the “country” column
- county > Edit column > Add column by fetching URLs…
- New column name = JSON data
- Throttle delay = 2000
- Expression =
  'https://www.broadbandmap.gov/broadbandmap/census/county/'+value+'?format=json'
- OK
  - Wait for the results. If you limited to the matching rows in the select subset section this will only take a few seconds.

Parse

Now parse the value of the JSON data “fips” element; call the “fips” key when traversing the “county” objects from the Results set.

JSON data > Edit column > Add column based on this column …
- New column name = FIPS Code
- expression = value.parseJson().Results.county[0].fips
- Note the square-bracket ([0]) notation in the ParseJson() function denotes and identifies the first array element. It’s the first element because in OpenRefine counting begins with zero (e.g. 0,1,2,3,4,5). The county array in the example below consists of only 1 value element (consisting of four, named key/value pairs; of which fips is one key). Since the JSON notation indicates county is an array, in this case of quantity 1, we identify that first element of the array by the number ‘0’. See the JSON example below

JSON Data Example

JSON ⁵ is JavaScript Object Notation a data wrapper. The API, in this case, returns the data in a JSON format.

{
  "status": "OK",  
  "responseTime": 14,  
  "message": [  ],  
  "Results": {  
    "county": [  
      {  
        "geographyType": "COUNTY2010",  
        "stateFips": "37",  
        "fips": "37063",  
        "name": "Durham"  
      }  
    ]  
  }  
}