2 Hands-on: GREL
The goal of this project is to create custom facets and perform basic transformations, introduce you to GREL – the General Refine Expression Language – and develop practical skills in transforming and normalizing data. You’ve used GREL several times getting to this point in this workbook, but lets look a little deeper.
In this example you will open a new data set to practice importing a different type of data, this time comma separated value, or CSV. Last time you imported a tab separated value, or TSV. OpenRefine will important many types of data. Being aware of different data formats, and having the ability to open those data are valuable skills. In fact, one of the most frequent stumbling blocks when using new data tools is the simple problem of importing the data. Practice and learn…
2.1 Custom Facets
Download new Data then Create a new OpenRefine Project:
We’ll use a subset6 of data from the Raleigh Open Data portal: “Raleigh Police Incident Data 2005plus2014.csv”.
- Import Data
https://github.com/libjohn/openrefine/raw/master/data/Raleigh%20Police%20Incident%20Data%202005plus2014.csv
- You many want to give your project a pretty title
- Create Project >>
- Isolate data: you have incident data without location information
- Click true (True = cell is blank ; False = cell has data)
- How many matching rows have no location data?
10,576
Explanation
In the step above, you limited your data set to find all the police incident data where there is no location information. That is, when the cell is blank there is no location information.
Explanation for the next step: You’ll create a custom facet (by Year) without altering any data. To do this, you’ll use a very simple expression which takes a slice of data based on fixed position, starting in position 6 (counting begins at 0) and ending four character positions later at 10* (6 + 4=10). Given the data, find the year value within the “INC DATETIME” field.
DATA: 12/31/2014 11:05:00 PM Position: 0123 6 1 2 1 0
For Example…
- To take the year “2014”, use expression:value[6,10]
7
- To take the hour, usevalue[11,13]
- To take PM (or AM), usevalue[20,22]
- Create a Custom Facet to Sort by Year
- Expression =
value[6,10]
- Sort the facet by Count
- Which year has the most missing data?
- How many rows match that year?
- Click “Remove All” in the Facet/Filter sidebar
2005
10,007 rows with missing location data
- Keep only the data with location values, i.e. delete all rows that have
no location data…
- Limit to “true”
- Close (“X out”) the facets
2.2 GREL to Transform and Normalize
The General Refine Expression Language (GREL) is a powerful and extensible language to manipulate data. In these next steps we will learn GREL by using practical steps to improve the structure of the data.
- Split the LOCATION Column into two columns (Latitude and Longitude)
(i.e. Accept the defaults)- Rename Columns
- Remove parenthesis and trim whitespace
- Expression =
value.replace("(","")
- Expression =
- Expression =
value.replace(")","")
- Expression =
Regular Expression
Regular Expressions are very powerful and flexible codes used for matching patterns. Often there is more than one way to compose a regex pattern-match. Importantly for OpenRefine, much of Refine’s extensible and advanced power comes from regular expressions. Essentially the key to advanced level OpenRefine is regular expressions and looping. To learn more about regular expressions see my handout on regex. For now you can refer to these few commands and symbols summed up on this quick-sheet.
- Create New Column from Existing Column using a regular expression to match a pattern
- New column name =
YEAR
- Expression =
value.match(/.*\/(\d\d\d\d).*/)[0]
> 8
- New column name =
Filter Text
- Create a Text Filter: Explore your data to find how many incident descriptions involve bicycles
-
In the filter box, enter the term
bicycle
- How many matching rows of data use the word “bicycle” in the description?
251
Facet
- Make a Text Facet on the LCR DESC column
- How many Bicycle categories exist in the facet window?
- Which facet is most often used “LCR DESC” facet?
6
LARCENY/BICYCLES ($50-$199) - 119
Cluster
- Look for Cluster variations in the FELONY Column
- Clustering refers to the operation of finding groups of different values that might be alternative representations of the same thing. For example “Gödel” and “Godel”. This is a handy way to find spelling variations.
- Method = “key collision” ; Key Function = “metaphone3”
- Method = “key collision” ; Key Function = “metaphone3”
- For each row, check “Merge?” and change the “New Cell Value”
- to
Felony
in the first row
- to
Not Felony
in the second row
- click the “Merge Selected & Re-Cluster” button
- Try other “Methods” and “Keying Functions”. “Merge Selected & Re-Cluster” after each operation
- Close
- How many “Not Felony” bicycle larcenies were recorded?
There are 234 “Not Felony” items after clustering using both “key collision” and “nearest neighbor” methods of clustering
- to
- Clustering refers to the operation of finding groups of different values that might be alternative representations of the same thing. For example “Gödel” and “Godel”. This is a handy way to find spelling variations.
Mass Editing via Facets
- Mass Edit cells in LCR DESC Facet:
- Mouseover & Edit “LARCENY/BICYCLE/FELONY/$200- …”
- delete “/FELONY”
- Add an “S” to ‘BICYCLE’ > Apply
- Mouseover & Edit “LARCENY/BICYCLE/FELONY/$200- …”
- How many rows match the “LARCENY/BICYCLES ($200-$1000)” Category
There are 115 observatins (or rows) matching “LARCENY/BICYCLES ($200-$1,000)” category
- Export as an Excel File
Original Data Source: Raleigh Open Data↩
You may notice what seems to be either a math or logic error in how this second value is “counted.” The simple explanation is the first number counts from zero. Add to that number the string length counting from 1 to get the second value.↩
Here’s another GREL example that captures the same data:
value.match(/(\d\d)\/\d{2}\/(\d{4})\s.*\s([A|P][M])/)[1]
↩