2 Hands-on: GREL

The goal of this project is to create custom facets and perform basic transformations, introduce you to GREL – the General Refine Expression Language – and develop practical skills in transforming and normalizing data. You’ve used GREL several times getting to this point in this workbook, but lets look a little deeper.

In this example you will open a new data set to practice importing a different type of data, this time comma separated value, or CSV. Last time you imported a tab separated value, or TSV. OpenRefine will important many types of data. Being aware of different data formats, and having the ability to open those data are valuable skills. In fact, one of the most frequent stumbling blocks when using new data tools is the simple problem of importing the data. Practice and learn…

2.1 Custom Facets

Download new Data then Create a new OpenRefine Project:

We’ll use a subset6 of data from the Raleigh Open Data portal: “Raleigh Police Incident Data 2005plus2014.csv”.

  1. Import Data
    • Create Project > Web Addresses (URLs) > https://github.com/libjohn/openrefine/raw/master/data/Raleigh%20Police%20Incident%20Data%202005plus2014.csv
    • Next >>
    • You many want to give your project a pretty title
    • Create Project >>
  2. Isolate data: you have incident data without location information
    • Location > Facet > Customized facets > Facet by blank
    • Click true (True = cell is blank ; False = cell has data)
      • How many matching rows have no location data?
      • 10,576

Explanation
In the step above, you limited your data set to find all the police incident data where there is no location information. That is, when the cell is blank there is no location information.

Explanation for the next step: You’ll create a custom facet (by Year) without altering any data. To do this, you’ll use a very simple expression which takes a slice of data based on fixed position, starting in position 6 (counting begins at 0) and ending four character positions later at 10* (6 + 4=10). Given the data, find the year value within the “INC DATETIME” field.

DATA:      12/31/2014 11:05:00 PM
Position:  0123  6    1        2
                      1        0

For Example…
- To take the year “2014”, use expression: value[6,10]7
- To take the hour, use value[11,13]
- To take PM (or AM), use value[20,22]

  1. Create a Custom Facet to Sort by Year
    • INC DATETIME > Facet > Custom text facet …
    • Expression = value[6,10] > OK
  1. Sort the facet by Count
    • Which year has the most missing data?
    • 2005

    • How many rows match that year?
    • 10,007 rows with missing location data

    • Click “Remove All” in the Facet/Filter sidebar
  1. Keep only the data with location values, i.e. delete all rows that have
    no location data…
    • Location > Facet > Customized facets > Facet by blank
    • Limit to “true
    • All > Edit rows > Remove all matching rows
    • Close (“X out”) the facets

2.2 GREL to Transform and Normalize

The General Refine Expression Language (GREL) is a powerful and extensible language to manipulate data. In these next steps we will learn GREL by using practical steps to improve the structure of the data.

  1. Split the LOCATION Column into two columns (Latitude and Longitude)
    • LOCATION > Edit column > Split into several columns… > OK (i.e. Accept the defaults)
    • Rename Columns
      • Location 1 > Edit column > Rename this column > Latitude
      • Location 2 > Edit column > Rename this column > Longitude
    • Remove parenthesis and trim whitespace
      • Latitude > Edit cells > Transform …
        1. Expression = value.replace("(","") > OK
        2. Latitude > Edit cells > Common transformations > Trim leading and trailing whitespace
      • Longitude > Edit cells > Transform …
        1. Expression = value.replace(")","") > OK
        2. Longitude > Edit cells > Common transformations > Trim leading and trailing whitespace

Regular Expression

Regular Expressions are very powerful and flexible codes used for matching patterns. Often there is more than one way to compose a regex pattern-match. Importantly for OpenRefine, much of Refine’s extensible and advanced power comes from regular expressions. Essentially the key to advanced level OpenRefine is regular expressions and looping. To learn more about regular expressions see my handout on regex. For now you can refer to these few commands and symbols summed up on this quick-sheet.

  1. Create New Column from Existing Column using a regular expression to match a pattern
    • INC DATETIME > Edit column > Add column based on this column …
      1. New column name = YEAR
      2. Expression = value.match(/.*\/(\d\d\d\d).*/)[0] > OK8

Filter Text

  1. Create a Text Filter: Explore your data to find how many incident descriptions involve bicycles
    1. LCR DESC >Text filter
    2. In the filter box, enter the term bicycle
    3. How many matching rows of data use the word “bicycle” in the description?
    4. 251

Facet

  1. Make a Text Facet on the LCR DESC column
    1. LCR DESC > Facet > Text facet
    2. How many Bicycle categories exist in the facet window?
    3. 6

    4. Which facet is most often used “LCR DESC” facet?
    5. LARCENY/BICYCLES ($50-$199) - 119

Cluster

  1. Look for Cluster variations in the FELONY Column
    • Clustering refers to the operation of finding groups of different values that might be alternative representations of the same thing. For example “Gödel” and “Godel”. This is a handy way to find spelling variations.
    • FELONY > Facet > Text facet
    • FELONY > Edit cells > Cluster and edit …
      • Method = “key collision” ; Key Function = “metaphone3”
    • For each row, check “Merge?” and change the “New Cell Value”
      1. to Felony in the first row
      2. to Not Felony in the second row
      3. click the “Merge Selected & Re-Cluster” button
      4. Try other “Methods” and “Keying Functions”. “Merge Selected & Re-Cluster” after each operation
      5. Close
        • How many “Not Felony” bicycle larcenies were recorded?
        • There are 234 “Not Felony” items after clustering using both “key collision” and “nearest neighbor” methods of clustering

Mass Editing via Facets

  1. Mass Edit cells in LCR DESC Facet: LCR DESC > Facet > Text facet
    • Mouseover & Edit “LARCENY/BICYCLE/FELONY/$200- …”
    • delete “/FELONY”
    • Add an “S” to ‘BICYCLE’ > Apply
    • How many rows match the “LARCENY/BICYCLES ($200-$1000)” Category
    • There are 115 observatins (or rows) matching “LARCENY/BICYCLES ($200-$1,000)” category

  2. Export as an Excel File
    • Export > Excel

  1. Original Data Source: Raleigh Open Data

  2. You may notice what seems to be either a math or logic error in how this second value is “counted.” The simple explanation is the first number counts from zero. Add to that number the string length counting from 1 to get the second value.

  3. Here’s another GREL example that captures the same data: value.match(/(\d\d)\/\d{2}\/(\d{4})\s.*\s([A|P][M])/)[1]