7 Hands-on: HTML Parsing

Goal: Learn how to use OpenRefine’s HTML parsing capabilities by fetching some David Price press releases and then parsing the content.

  1. Import Data

    • Create Project > Web Addresses (URLs) > https://raw.githubusercontent.com/libjohn/openrefine/master/data/price-crawl-and-HTML-parse.csv
    • Next >>
    • You many want to give your project a pretty title
    • Create Project >>

7.1 Fetch

Now let’s fetch the data by crawling a few links to Congressman Price’s press releases. This will return large amounts of raw HTML that can be hard to read. So, after fetching, we’ll parse the result.

  1. Fetch HTML

    • prlink-href > Edit column > Add column by fetching URLs…
    • New column name = raw HTML
    • Throttle delay = 2000
    • Expression =
      value
    • OK

7.2 Parse

Now parse the HTML data.

  1. raw HTML > Edit column > Add column based on this column …

    • New column name = HTML title
    • expression = value.parseHtml().select("title")[0].htmlText()   17
    • OK
  2. raw HTML > Edit column > Add column based on this column …

    • New column name = body title
    • expression = value.parseHtml().select("h1#page-title.title")[0].htmlText()
    • OK
  3. raw HTML > Edit column > Add column based on this column …

    • New column name = date2
    • expression = value.parseHtml().select("div.pane-content")[0].htmlText()
    • OK
  4. raw HTML > Edit column > Add column based on this column …

    • New column name = dateline
    • expression = value.parseHtml().select("div.field-item.even p strong")[0].htmlText()
    • OK
  5. raw HTML > Edit column > Add column based on this column …

    • New column name = links
    • expression =

      forEach(
      value.parseHtml().select("div#block-system-main")[0].select("a"),
      e,
      e.htmlAttr("href")
      ).join("|")
    • OK

  6. raw HTML > Edit column > Add column based on this column …

    • New column name = link text
    • expression =

      forEach(
      value.parseHtml().select("div#block-system-main")[0].select("a"),
      e,
      e.htmlText()
      ).join("|")
    • OK

7.3 Inspect your work…

  1. raw HTML > View > Collapse this column

  2. Click the records link in the “Show as: rows records” section, above the column headers

  3. links > Edit cells > Split multi-valued cells…

    • for the by separator option, in the Separator textbox enter a pipe: |
    • repeate this step for the link text column
  4. Look around. Scroll left to right and wee what you’ve parsed.


  1. Note the square-bracket ([0]) notation in the ParseHtml() function denotes and identifies the first array element. It’s the first element because in OpenRefine counting begins with zero (e.g. 0,1,2,3,4,5).