Demonstration
Import Data
https://raw.githubusercontent.com/libjohn/openrefine/master/data/bicycle-subset-phm-collection.tsv
- Uncheck - “Quotation marks are used to enclose” (bottom-right)
- Project Name to “bicycle categories”
Facets
- Notice mashup of “pipe delimited” text facets
- Why are there Pipe delimiters?
- How many Category choices exist?
- Switch between the “name” and “count” sorting options.
- Export a CSV file
- How many rows will export: 4, 16, 72, 180?
Rows vs. Records
Row v Records 2
|
- Now how many Category choices exist?
- Click records to switch to the records view
- Click count What is the most popular Category term?
- Click name Limit to the “Juvenilia” facet; How many matching records?
separator =
Web Scrape & API
Scraping data from an API or Webpage
- New column name =
Web Data
- Notice Throttle delay
2000
milliseconds is good. Less than 2 seconds and you might get booted, blocked, cast out!
- Click OK and wait
- When done, you’ve got all the source data for each page associated with the link in the “Persistent Link” column
Parsing Data
A Brief Introduction
- Look fora big block of gobbly text in the “Web Data” column. That is HTML formatted data
- Typically, after fetching data, you’ll need to parse it
- You can use GREL 3 to parse data and isolate a specific bit of desired information
- GREL is the advanced power of OpenRefine. We will come back to this
- For example, you can transform the data in the retrieved cells with a regular expression to gather only the title of each object
- New Column Name =
Web Page Title
- Expression =
value.parseHtml().select("h1.object-page__title")[0].htmlText()
- New Column Name =
Split
remove faceting: click
in the facet sidebar- Seperator =
-
(i.e. “space dash space”)
- Seperator =
Concatenate
- GREL Expression =
value + " x " + cells["Width"].value
Find & Replace
- GREL Expression =
value.replace("mm","")
Facet by Blank
Text filter
- Clear all facets
- Title =
bmx
Custom Facet
- Production date
value.match(/(\d\d).*/)[0]