6 Hands-on: Web Scraping

Goal: We want to gather the FIPS code for some data by matching a county name to a government server which maintains a server that enables FIPS lookups. The government server returns data in a JSON format. After gathering the data we’ll parse the JSON format and isolate the particular data from the data wrapper.

  1. Import Data
    • Create Project > Web Addresses (URLs) > https://raw.githubusercontent.com/libjohn/openrefine/master/data/subset-Raleigh-Building-Permits-for-API-JSON-parsing.csv
    • Next >>
    • You many want to give your project a pretty title
    • Create Project >>

6.1 API

Now let’s fetch the data from a Web API made available via the National Broadband Map (NBM | NBM Developer). This API returns a FIPS code, given a county name (or in this case, even a partial county name.) We’ll use the API documentation for Geography by Name API14. In this case, simply use the County value as the search string wrapped in the API query protocol.

  1. Fetch JSON data from the National Broadband Map.
    • county > Edit column > Add column by fetching URLs…
    • New column name = JSON data
    • Throttle delay = 2000
    • Expression =
      'https://www.broadbandmap.gov/broadbandmap/census/county/'+value+'?format=json'
    • OK

6.2 Parse

Now parse the value of the JSON data “fips” element; call the “fips” key when traversing the “county” objects from the Results set.

  1. JSON data > Edit column > Add column based on this column …
    • New column name = FIPS Code
    • expression = value.parseJson().Results.county[0].fips   15
    • OK

JSON Data Example

JSON 16 is JavaScript Object Notation a data wrapper. The API, in this case, returns the data in a JSON format.

{
  "status": "OK",  
  "responseTime": 14,  
  "message": [  ],  
  "Results": {  
    "county": [  
      {  
        "geographyType": "COUNTY2010",  
        "stateFips": "37",  
        "fips": "37063",  
        "name": "Durham"  
      }  
    ]  
  }  
}  

6.3 Keys

Web API keys are typically issued by the API service to help identify the origin of the API request. Contrary to the above example, most API services require keys. In this next example you’ll practice getting and using a key. In this case the key is free. They are not always free.

  1. Go to the OMDB API key request page. Enter your email and choose the free key option.
  2. Check your email for your key. You’ll use this key appended to your search query. The documentation for this API can found at their Parameters section. It’s also helpful to have a look at the Examples section.

6.4 Exercise 2

You’ll need your API key from the preceding steps.

  1. Create a new project in OpenRefine
  2. Click the Clipboard option, under Get data from and paste in the following:
The Warriors
Rocky
Bambi
  1. Give your project a nice name (e.g. “Movies”) and Create Project >>
  2. Fetch JSON data from the OMDB API

    • Column 1 > Edit column > Add column by fetching URLs…
    • New column name = JSON data
    • Throttle delay = 2000
    • Expression =
      'http://www.omdbapi.com/?t=' + escape(value, "url") + '&' + 'apikey=Yogi-and-BOO-BOO-BOY'
    • Be sure to substitute your actual api key (see above) for Yogi-and-BOO-BOO-BOY
    • OK

Parse

  1. JSON data > Edit column > Add column based on this column …

    • New column name = Year
    • expression = value.parseJson().Year
    • OK
  2. JSON data > Edit column > Add column based on this column …

    • New column name = Poster
    • expression = value.parseJson().Poster
    • OK

  1. The documentation informs us that the format of the URL we want to construct is as follows: http://www.broadbandmap.gov/broadbandmap/census/county/durh?format=json

  2. Note the square-bracket ([0]) notation in the ParseJson() function denotes and identifies the first array element. It’s the first element because in OpenRefine counting begins with zero (e.g. 0,1,2,3,4,5). The county array in the example below consists of only 1 value element (consisting of four, named key/value pairs; of which fips is one key). Since the JSON notation indicates county is an array, in this case of quantity 1, we identify that first element of the array by the number ‘0’. See the JSON example.

  3. Wikipedia entry for JSON