Demonstration

Import Data

  1. Create Project > Web Addresses (URLs) > https://raw.githubusercontent.com/libjohn/openrefine/master/data/bicycle-subset-phm-collection.tsv
  2. Uncheck - “Quotation marks are used to enclose” (bottom-right)
  3. Project Name to “bicycle categories”

Facets

  • Categories > Facet > Text facet
    1. Notice mashup of “pipe delimited” text facets
    2. Why are there Pipe delimiters?
    3. How many Category choices exist?
    4. Switch between the “name” and “count” sorting options.
  • Export a CSV file
  • Export > Comma-separated value
  • How many rows will export: 4, 16, 72, 180?

Rows vs. Records

  • Row v Records 2

  • Categories > edit cells > split multi-valued cells > separator = |
    1. Now how many Category choices exist?
    2. Click records to switch to the records view
    3. Click count What is the most popular Category term?
    4. Click name Limit to the “Juvenilia” facet; How many matching records?

Web Scrape & API

Scraping data from an API or Webpage

  • new_link > Edit column > Add column by fetching URLs…
  • New column name = Web Data
  • Notice Throttle delay
    1. 2000 milliseconds is good. Less than 2 seconds and you might get booted, blocked, cast out!
  • Click OK and wait
  • When done, you’ve got all the source data for each page associated with the link in the “Persistent Link” column

Parsing Data

A Brief Introduction

  • Look fora big block of gobbly text in the “Web Data” column. That is HTML formatted data
  • Typically, after fetching data, you’ll need to parse it
  • You can use GREL 3 to parse data and isolate a specific bit of desired information
  • For example, you can transform the data in the retrieved cells with a regular expression to gather only the title of each object
  • Web Data > Edit column > Add column based on this column…
    1. New Column Name = Web Page Title
    2. Expression = value.parseHtml().select("h1.object-page__title")[0].htmlText()

Split

  • remove faceting: click Remove All in the facet sidebar

  • Production Date > Edit column > Split into several columns…
    1. Seperator = - (i.e. “space dash space”)

Concatenate

  • Height > Edit column > Add column based on this column
  • GREL Expression = value + " x " + cells["Width"].value

Find & Replace

  • Dimension > Edit cells > Transform
  • GREL Expression = value.replace("mm","")

Facet by Blank

  • Dimension > Facet > Customized facet > facet by blank

Text filter

  • Clear all facets
  • Title = bmx

Custom Facet

  • Production date
  • Facet > custom text facet
  • value.match(/(\d\d).*/)[0]