Intro to Regular Expressions (RegEx)

A workshop and workbook

Slides, exercises and answers
Part of the Data & Visualization Services – Workshop Series
Recording and workshop materials are available for this and all of our Past Workshops

Prerequisites

Bring YOUR Laptop

Outline

Data & Visualization Services
- Who We Are
This workshop is inspired by the good work at Intersect in Australia and the excellent introduction at Wikipedia
Hands-on
- Introduction to RegEx101.com
- Exercises
Closing
- Feedback Form
- Go forth expressionful, finding patterns with alacrity

RegEx – an Introduction

Slides

Exercise 1 - Find Patterns

RegEx101

RegEx101.com
Exercises derived from Intersect Tutorial on Regular Expressions

Whole words

Find all literal words
Simple searches are just like traditional find & replace

In a separate tab, open the example RFC text
1. Paste the text into the RegEx101 Test String box
2. In the Expression box, enter avian – 2 matches
  …
3. Use Avian – 12 matches
4. Use avian – with the insensitive case flag – 14 matches
  … …
5. Clear the ignore case flag

Capitalization & character classes

Cheatsheet

Find only capitalized words
1. [A-Z]\w*
2. [A-Z]+ – match only “all caps” words. BUT this is not quite right. It doesn’t work. Do you know why?
3. \b[A-Z]+\b – match on a word boundary using an anchor class: \b
4. \b[A-Z]{2,}\b – Abbreviations are usually 2 or more upper case characters.

Note

Some characters don’t have a literal meaning, these are known as meta characters: e.g. " or [
character classes [ ] are denoted by square brackets
wildcard character is the period: .
multipliers include * , + , ?
Squiggly brackets { } allow for define repetition
\b is an anchor denoting a word boundary

Last word

Match the last words of sentences
1. \w+. – This doesn’t work because “.” matches every character
2. \w+\. – We escape the period “.” with a the escape character \
3. \w+\.\s – More precise this time. Matching on 56 words. Using \s allows us to stop matching email address by matching whitespace \s

Note

\w is a “word”" character
\s is a “space” character
. is a meta character (introduced above)

Numbers

Find all years
1. \d\d\d\d – a lot of matches here
2. \d{4} – more succinct but has the same meaning as above
3. \b\d{4}\b – word boundaries \b help but there are still some false positives
4. \b(19|20)\d\d\b – better and works for the twenties and twenty-first centuries

Note

the pipe character | for alternation, alternatives
( ) grouping
{ } multiplier

Phone numbers

Find a phone number
1. $\d{3}$ \d{3}-\d{4} – Very specific. This works as long as phone numbers are formatted consistently
2. $?\d+$? ?[\d-]{5,}\d – more permissive
3. $?\d+($|.)? ?[\d-.]{5,}\d – more permissive still. Allows for . instead of - as a separator

Note

escape the parenthesis \(
Note: ? indicates optionality matching zero or one occurrence

Email addresses

\w+@[\w\.]+
1. left-hand sidebar of RegEx101.com
2. click Regex Library
3. search on the term ‘email address’
4. From there, I found this pattern
  
  Wow! That saved a lot of time and thinking.
  You can rate useful contributions from the community and “privilege” useful contributions.

Note

This rule is quite permissive. It’s likely to match some invalid email addresses. e.g. fred@invalid.net
It’s also likely to miss valid email addresses like vincent.zappa@intersect.org.au
SUGGESTION: ask the RegEx101 community

Section headings

1. Using the Flag setting of RegEx101 (upper-right), set to multiline – This enables ^ and $ anchors
2. ^(\w+ ?)+$ – match repeating words + optional space
3. Reset the multiline flag

Note

+ can be applied to a group
^ prior to a match pattern means begins with
$ following a match patterns means ends with

Exercise 1

Exercise 1

Exercise 2 - More Patterns

Get example text

Open Othello text in a browser window
Paste the text into the RegEx101 Test String box

Alternative spellings & case

Turn on the case insensitive flag
honour – 14 matches
honou?r – optional “u” and still 14 matches
hon(our|ourable|esty?) – honour honourable, honest, honesty; for 66 matches
Turn off case insensitive flag – 61 matches

Alternative letters and multi-line

Turn on multiline matching
^(ACT|SCENE) [IVXLCDM]+ – literal word, space, roman numerals; for 20 matches
Turn off multiline matching

Beginning and ending anchors

Turn on multiline matching
^[A-Z]+$
Turn off multiline matching

Punctuation

Turn on multiline matching
^.*\? – from start of line to question mark
Turn off multiline matching

Exercise 2

Exercise 2

Exercise 3 - Replace

Preparation

Generate a small list of random names from the random name generator
Click the “List in text area” button, copy and paste the names list to your buffer
Replace the text in the RegEx101 Test String box with the random names
Click the Substitution care at the bottom of RegEx101 screen
Notice the $0 code.

Capture Groups

(\w+) (\w+) – in the Expression panel will highlight all names
"$0" – in the Substitution pane will reproduce the text pattern matched within forward slashes (Expression pane / /)
- $2, $1 – swap the order of the first and last name and precede the whole name with a dash ‘-’
<b>$2</b>, $1 – Bold the last name and add a coma

Exercise 3

Exercise 3

Exercise 4 - Special case

Preparation

Open this twitter stream data pre-loaded into RegExr.

Please note this is actual twitter stream data about a politician, the tweets may be offensive

You can do this exercise in Regexr.com or copy the textbox data and paste to RegEx101.com

Capture #hashtags, @twitter-handles

#\w+ – #hashtag
@\w+ – @twitter-handle
@[A-Za-z]\w+ – avoids matching times, e.g. @kairos we’re meeting for drinks @10

Exercise 4

Exercise 4

Thank You for Attending

Please complete the paper Feedback Form
Presenter
- John Little
- Data & Visualization Services
- Duke University Libraries

Shareable under CC BY-NC license

Data, presentation, and handouts are shareable under CC BY-NC license

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Intro to Regular Expressions (RegEx)

John Little

2017-01-27

Prerequisites

Outline

RegEx – an Introduction

Exercise 1 - Find Patterns

RegEx101

Whole words

Capitalization & character classes

Last word

Numbers

Phone numbers

Email addresses

Section headings

Exercise 1

Exercise 2 - More Patterns

Get example text

Alternative spellings & case

Alternative letters and multi-line

Beginning and ending anchors

Punctuation

Exercise 2

Exercise 3 - Replace

Preparation

Capture Groups

Exercise 3

Exercise 4 - Special case

Preparation

Capture #hashtags, @twitter-handles

Exercise 4

Thank You for Attending

Shareable under CC BY-NC license