7 Hands-on: HTML Parsing
Goal: Learn how to use OpenRefine’s HTML parsing capabilities by fetching some David Price press releases and then parsing the content.
Import Data
https://raw.githubusercontent.com/libjohn/openrefine/master/data/price-crawl-and-HTML-parse.csv
- You many want to give your project a pretty title
- Create Project >>
7.1 Fetch
Now let’s fetch the data by crawling a few links to Congressman Price’s press releases. This will return large amounts of raw HTML that can be hard to read. So, after fetching, we’ll parse the result.
Fetch HTML
- New column name =
raw HTML
- Throttle delay =
2000
- Expression =
value
7.2 Parse
Now parse the HTML data.
- New column name =
HTML title
- expression =
value.parseHtml().select("title")[0].htmlText()
17
- New column name =
- New column name =
body title
- expression =
value.parseHtml().select("h1#page-title.title")[0].htmlText()
- New column name =
- New column name =
date2
- expression =
value.parseHtml().select("div.pane-content")[0].htmlText()
- New column name =
- New column name =
dateline
- expression =
value.parseHtml().select("div.field-item.even p strong")[0].htmlText()
- New column name =
- New column name =
links
expression =
forEach( value.parseHtml().select("div#block-system-main")[0].select("a"), e, e.htmlAttr("href") ).join("|")
- New column name =
- New column name =
link text
expression =
forEach( value.parseHtml().select("div#block-system-main")[0].select("a"), e, e.htmlText() ).join("|")
- New column name =
7.3 Inspect your work…
Click the
records
link in the “Show as: rows records” section, above the column headers- for the by separator option, in the Separator textbox enter a pipe:
|
- repeate this step for the
link text
column
- for the by separator option, in the Separator textbox enter a pipe:
Look around. Scroll left to right and wee what you’ve parsed.
Note the square-bracket (
[0]
) notation in theParseHtml()
function denotes and identifies the first array element. It’s the first element because in OpenRefine counting begins with zero (e.g. 0,1,2,3,4,5).↩