A workshop and workbook

Prerequisites

  • Bring YOUR Laptop

Outline

  • Data & Visualization Services
    • Who We Are
  • This workshop is inspired by the good work at Intersect in Australia and the excellent introduction at Wikipedia
  • Hands-on
  • Closing
    • Feedback Form
    • Go forth expressionful, finding patterns with alacrity

RegEx – an Introduction

Exercise 1 - Find Patterns

RegEx101

  • RegEx101.com
  • Exercises derived from Intersect Tutorial on Regular Expressions

Whole words

  • Find all literal words
  • Simple searches are just like traditional find & replace
  1. In a separate tab, open the example RFC text
    1. Paste the text into the RegEx101 Test String box
    2. In the Expression box, enter avian – 2 matches

    3. Use Avian – 12 matches
    4. Use avian – with the insensitive case flag – 14 matches

    5. Clear the ignore case flag

Capitalization & character classes

Cheatsheet

  1. Find only capitalized words
    1. [A-Z]\w*
    2. [A-Z]+ – match only “all caps” words. BUT this is not quite right. It doesn’t work. Do you know why?
    3. \b[A-Z]+\b – match on a word boundary using an anchor class: \b
    4. \b[A-Z]{2,}\b – Abbreviations are usually 2 or more upper case characters.

Note

  • Some characters don’t have a literal meaning, these are known as meta characters: e.g. " or [
  • character classes [ ] are denoted by square brackets
  • wildcard character is the period: .
  • multipliers include * , + , ?
  • Squiggly brackets { } allow for define repetition
  • \b is an anchor denoting a word boundary

Last word

  1. Match the last words of sentences

    1. \w+. – This doesn’t work because “.” matches every character
    2. \w+\. – We escape the period “.” with a the escape character \
    3. \w+\.\s – More precise this time. Matching on 56 words. Using \s allows us to stop matching email address by matching whitespace \s

Note

  • \w is a “word”" character
  • \s is a “space” character
  • . is a meta character (introduced above)

Numbers

  1. Find all years

    1. \d\d\d\d – a lot of matches here
    2. \d{4} – more succinct but has the same meaning as above
    3. \b\d{4}\b – word boundaries \b help but there are still some false positives
    4. \b(19|20)\d\d\b – better and works for the twenties and twenty-first centuries

Note

  • the pipe character | for alternation, alternatives
  • ( ) grouping
  • { } multiplier

Phone numbers

  1. Find a phone number

    1. \(\d{3}\) \d{3}-\d{4} – Very specific. This works as long as phone numbers are formatted consistently
    2. \(?\d+\)? ?[\d-]{5,}\d – more permissive
    3. \(?\d+(\)|.)? ?[\d-.]{5,}\d – more permissive still. Allows for . instead of - as a separator

Note

  • escape the parenthesis \(
  • Note: ? indicates optionality matching zero or one occurrence

Email addresses

  1. \w+@[\w\.]+

    1. left-hand sidebar of RegEx101.com
    2. click Regex Library
    3. search on the term ‘email address’
    4. From there, I found this pattern

      Wow! That saved a lot of time and thinking.
      You can rate useful contributions from the community and “privilege” useful contributions.

Note

  • This rule is quite permissive. It’s likely to match some invalid email addresses. e.g. fred@invalid.net
  • It’s also likely to miss valid email addresses like vincent.zappa@intersect.org.au
  • SUGGESTION: ask the RegEx101 community

Section headings

  1.  

    1. Using the Flag setting of RegEx101 (upper-right), set to multiline – This enables ^ and $ anchors
    2. ^(\w+ ?)+$ – match repeating words + optional space
    3. Reset the multiline flag

Note

  • + can be applied to a group
  • ^ prior to a match pattern means begins with
  • $ following a match patterns means ends with

Exercise 1

Exercise 2 - More Patterns

Get example text

  1. Open Othello text in a browser window
  2. Paste the text into the RegEx101 Test String box

Alternative spellings & case

  1. Turn on the case insensitive flag
  2. honour – 14 matches
  3. honou?r – optional “u” and still 14 matches
  4. hon(our|ourable|esty?) – honour honourable, honest, honesty; for 66 matches
  5. Turn off case insensitive flag – 61 matches

Alternative letters and multi-line

  1. Turn on multiline matching
  2. ^(ACT|SCENE) [IVXLCDM]+ – literal word, space, roman numerals; for 20 matches
  3. Turn off multiline matching

Beginning and ending anchors

  1. Turn on multiline matching
  2. ^[A-Z]+$
  3. Turn off multiline matching

Punctuation

  1. Turn on multiline matching
  2. ^.*\? – from start of line to question mark
  3. Turn off multiline matching

Exercise 2

Exercise 3 - Replace

Preparation

  1. Generate a small list of random names from the random name generator
  2. Click the “List in text area” button, copy and paste the names list to your buffer
  3. Replace the text in the RegEx101 Test String box with the random names
  4. Click the Substitution care at the bottom of RegEx101 screen
  5. Notice the $0 code.

Capture Groups

  1. (\w+) (\w+) – in the Expression panel will highlight all names
  2. "$0" – in the Substitution pane will reproduce the text pattern matched within forward slashes (Expression pane / /)
  3. - $2, $1 – swap the order of the first and last name and precede the whole name with a dash ‘-’
  4. <b>$2</b>, $1 – Bold the last name and add a coma

Exercise 3

Exercise 4 - Special case

Preparation

  1. Open this twitter stream data pre-loaded into RegExr.

Please note this is actual twitter stream data about a politician, the tweets may be offensive

You can do this exercise in Regexr.com or copy the textbox data and paste to RegEx101.com

Capture #hashtags, @twitter-handles

  1. #\w+ – #hashtag
  2. @\w+@twitter-handle
  3. @[A-Za-z]\w+ – avoids matching times, e.g. @kairos we’re meeting for drinks @10

Exercise 4


Thank You for Attending


Shareable under CC BY-NC license

Data, presentation, and handouts are shareable under CC BY-NC license

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.