7 Hands-on: HTML Parsing
Goal: Learn how to use OpenRefine’s HTML parsing capabilities by fetching some David Price press releases and then parsing the content.
Import Data
-
https://raw.githubusercontent.com/libjohn/openrefine/master/data/price-crawl-and-HTML-parse.csv
- You many want to give your project a pretty title
- Create Project >>
-
7.1 Fetch
Now let’s fetch the data by crawling a few links to Congressman Price’s press releases. This will return large amounts of raw HTML that can be hard to read. So, after fetching, we’ll parse the result.
Fetch HTML
- New column name =
raw HTML - Throttle delay =
2000 - Expression =
value
7.2 Parse
Now parse the HTML data.
- New column name =
HTML title
- expression =
value.parseHtml().select("title")[0].htmlText()17
- New column name =
- New column name =
body title
- expression =
value.parseHtml().select("h1#page-title.title")[0].htmlText()
- New column name =
- New column name =
date2
- expression =
value.parseHtml().select("div.pane-content")[0].htmlText()
- New column name =
- New column name =
dateline
- expression =
value.parseHtml().select("div.field-item.even p strong")[0].htmlText()
- New column name =
- New column name =
links
expression =
forEach( value.parseHtml().select("div#block-system-main")[0].select("a"), e, e.htmlAttr("href") ).join("|")
- New column name =
- New column name =
link text
expression =
forEach( value.parseHtml().select("div#block-system-main")[0].select("a"), e, e.htmlText() ).join("|")
- New column name =
7.3 Inspect your work…
Click the
recordslink in the “Show as: rows records” section, above the column headers- for the by separator option, in the Separator textbox enter a pipe:
| - repeate this step for the
link textcolumn
- for the by separator option, in the Separator textbox enter a pipe:
Look around. Scroll left to right and wee what you’ve parsed.
Note the square-bracket (
[0]) notation in theParseHtml()function denotes and identifies the first array element. It’s the first element because in OpenRefine counting begins with zero (e.g. 0,1,2,3,4,5).↩