6 Hands-on: Web Scraping
Goal: We want to gather the FIPS code for some data by matching a county name to a government server which maintains a server that enables FIPS lookups. The government server returns data in a JSON format. After gathering the data we’ll parse the JSON format and isolate the particular data from the data wrapper.
- Import Data
https://raw.githubusercontent.com/libjohn/openrefine/master/data/subset-Raleigh-Building-Permits-for-API-JSON-parsing.csv
- You many want to give your project a pretty title
- Create Project >>
6.1 API
Now let’s fetch the data from a Web API made available via the National Broadband Map (NBM | NBM Developer). This API returns a FIPS code, given a county name (or in this case, even a partial county name.) We’ll use the API documentation for Geography by Name API14. In this case, simply use the County value
as the search string wrapped in the API query protocol.
- Fetch JSON data from the National Broadband Map.
- New column name =
JSON data
- Throttle delay =
2000
- Expression =
'https://www.broadbandmap.gov/broadbandmap/census/county/'+value+'?format=json'
6.2 Parse
Now parse the value of the JSON data “fips” element; call the “fips” key when traversing the “county” objects from the Results set.
- New column name =
FIPS Code
- expression =
value.parseJson().Results.county[0].fips
15
- New column name =
JSON Data Example
JSON 16 is JavaScript Object Notation a data wrapper. The API, in this case, returns the data in a JSON format.
{ "status": "OK", "responseTime": 14, "message": [ ], "Results": { "county": [ { "geographyType": "COUNTY2010", "stateFips": "37", "fips": "37063", "name": "Durham" } ] } }
6.3 Keys
Web API keys are typically issued by the API service to help identify the origin of the API request. Contrary to the above example, most API services require keys. In this next example you’ll practice getting and using a key. In this case the key is free. They are not always free.
- Go to the OMDB API key request page. Enter your email and choose the free key option.
- Check your email for your key. You’ll use this key appended to your search query. The documentation for this API can found at their Parameters section. It’s also helpful to have a look at the Examples section.
6.4 Exercise 2
You’ll need your API key from the preceding steps.
- Create a new project in OpenRefine
- Click the
Clipboard
option, under Get data from and paste in the following:
The Warriors
Rocky
Bambi
- Give your project a nice name (e.g. “Movies”) and
Fetch JSON data from the OMDB API
- New column name =
JSON data
- Throttle delay =
2000
- Expression =
'http://www.omdbapi.com/?t=' + escape(value, "url") + '&' + 'apikey=Yogi-and-BOO-BOO-BOY'
- Be sure to substitute your actual api key (see above) for
Yogi-and-BOO-BOO-BOY
Parse
- New column name =
Year
- expression =
value.parseJson().Year
- New column name =
- New column name =
Poster
- expression =
value.parseJson().Poster
- New column name =
The documentation informs us that the format of the URL we want to construct is as follows: http://www.broadbandmap.gov/broadbandmap/census/county/durh?format=json↩
Note the square-bracket (
[0]
) notation in theParseJson()
function denotes and identifies the first array element. It’s the first element because in OpenRefine counting begins with zero (e.g. 0,1,2,3,4,5). The county array in the example below consists of only 1 value element (consisting of four, named key/value pairs; of which fips is one key). Since the JSON notation indicates county is an array, in this case of quantity 1, we identify that first element of the array by the number ‘0’. See the JSON example.↩