14  14. Intro to Regular Expressions (regex)

rm(list=ls())

14.1 Intro to Regular Expressions (Also Known As “regex”)

Regular expressions (AKA “regex”) are patterns that are used to identify specific types of data. They can be used to search through and slice and dice character (i.e. textual) data in a variety of ways. The best way to understand is just to dive right in, which we will do below.

Before we get started it’s important to note that regular expressions are not specic to R. Rather regular expressions are a concept that are available to use in most programming languages (e.g. Python, Java, etc.) as well as in many text editors and in other technical environments (e.g. Bash). There are slightly different versions of regular expression syntax found in different environments. We will be focusing mainly on how to use regular expressions in R. However, this knowledge carries over directly to using regex in other languages and technical environments.

14.2 Some resources for learning regular expressions

The following is one of the many online tutorials.

The following R help pages are relevant to regular expressions. However, they are hard to understand without a little intro first. I recommend that you go through the material in this book first or in other tutorials. Then you can refer to the help pages when necessary.

Relevant R help pages

?regex     
?grep      
?strsplit  

R’s stringr package also includes several functions that can be used with regex. For now though, we will focus mainly on the “Base R” regex functions. Once you understand those, you will be very prepared to understand functions in the stringr package that utilize regular expressions.

14.3 Data for examples

Before we start, let’s define some data to be used with examples in this file. (NOTE: I made up N. American apple and S. Korean Fig so that I can use *them in some examples.)

14.3.1 fruit vector

fruit
 [1] "apple"             "N. American apple" "S. Korean Fig"    
 [4] "fig"               "star fruit"        "pear"             
 [7] "prickly pear"      "Beurre Hardy pear" "cherry"           
[10] "black cherry"      "peach"             "plum"             
[13] "kumquat"           "banana"            "blueberry"        
[16] "strawberry"        "honeydew"          "strawberries"     
[19] "yumberry"         

14.3.2 addresses vector

addresses
 [1] "12345 Sesame Street"               "One Micro$oft Way"                
 [3] "3 Olive St."                       "Two 1st Ave."                     
 [5] "5678 Park Place"                   "Forty Five 2nd Street"            
 [7] "Ninety Nine Cone St. apartment 7"  "9 Main St. apt. 623"              
 [9] "Five Google Drive"                 "4\\2 Rechov Yafo"                 
[11] "Fifteen Watchamacallit Boulevard"  "Nineteen Watchamacallit Boulevard"
[13] "One Main Street Apt 12b"           "Two Main Street Apt 123c"         
[15] "Three Main Street Apt 12343"       "City Hall Lockport, NY"           
# show each address, one per line
cat(addresses, sep="\n")
12345 Sesame Street
One Micro$oft Way
3 Olive St.
Two 1st Ave.
5678 Park Place
Forty Five 2nd Street
Ninety Nine Cone St. apartment 7
9 Main St. apt. 623
Five Google Drive
4\2 Rechov Yafo
Fifteen Watchamacallit Boulevard
Nineteen Watchamacallit Boulevard
One Main Street Apt 12b
Two Main Street Apt 123c
Three Main Street Apt 12343
City Hall Lockport, NY

14.3.3 famousQuotes dataframe

famousQuotesDf
   Year                              Speaker
1   -47                        Julius Caesar
2  -399                             Socrates
3  -399                             Socrates
4  1597                        Francis Bacon
5  1603                  William Shakespeare
6  1637                       René Descartes
7  1759                             Voltaire
8  1775                        Patrick Henry
9  1789                     Marie Antoinette
10 1843                            Karl Marx
11 1848                            Karl Marx
12 1858                      Abraham Lincoln
13 1863                      Abraham Lincoln
14 1887                           Lord Acton
15 1901                   Theodore Roosevelt
16 1929 Sherlock Holmes (Arthur Conan Doyle)
17 1929                      Albert Einstein
18 1933                Franklin D. Roosevelt
19 1940                    Winston Churchill
20 1940                    Eleanor Roosevelt
21 1941                    Winston Churchill
22 1942                    Douglas MacArthur
23 1945                      Harry S. Truman
24 1947                       Mahatma Gandhi
25 1949                        George Orwell
26 1951                         Albert Camus
27 1961                      John F. Kennedy
28 1963               Martin Luther King Jr.
29 1964                         Muhammad Ali
30 1964                     Marshall McLuhan
31 1968               Martin Luther King Jr.
32 1969                       Neil Armstrong
33 1970                           Irina Dunn
34 1980                          John Lennon
35 1980                    Margaret Thatcher
36 1987                        Ronald Reagan
37 1988                    George H. W. Bush
38 2008                         Barack Obama
                                                                                                                       Quote
1                                                                                                 I came, I saw, I conquered
2                                                                                                 I know that I know nothing
3                                                                                    The unexamined life is not worth living
4                                                                                                         Knowledge is power
5                                                                                   To be or not to be, that is the question
6                                                                                                    I think, therefore I am
7                                          I disapprove of what you say, but I will defend to the death your right to say it
8                                                                                          Give me liberty, or give me death
9                                                                                                          Let them eat cake
10                                                                                       Religion is the opium of the people
11                                                                                              Workers of the world, unite!
12                                                                               A house divided against itself cannot stand
13                                                                   Government of the people, by the people, for the people
14                                                            Power tends to corrupt, and absolute power corrupts absolutely
15                                                                                        Speak softly and carry a big stick
16                                                                                                Elementary, my dear Watson
17                                                                              Imagination is more important than knowledge
18                                                                             The only thing we have to fear is fear itself
19                                                                                             We shall fight on the beaches
20                                                     The future belongs to those who believe in the beauty of their dreams
21                                                                                               Never, never, never give up
22                                                                                                            I shall return
23                                                                                                       The buck stops here
24                                                                                Be the change you wish to see in the world
25                                                                   War is peace. Freedom is slavery. Ignorance is strength
26 The only way to deal with an unfree world is to become so absolutely free that your very existence is an act of rebellion
27                                            Ask not what your country can do for you, ask what you can do for your country
28                                                                                                            I have a dream
29                                                                                  Float like a butterfly, sting like a bee
30                                                                                                 The medium is the message
31                                                                                              I've been to the mountaintop
32                                                                 That's one small step for man, one giant leap for mankind
33                                                                           A woman needs a man like a fish needs a bicycle
34                                                                 Life is what happens while you're busy making other plans
35                                                                                                   There is no alternative
36                                                                                        Mr. Gorbachev, tear down this wall
37                                                                                                Read my lips: no new taxes
38                                                                                                                Yes we can

14.4 The stringr package

14.4.1 stringr::words vector

The stringr package includes some data that we will use in the examples.

#-------------------------------------------------------------------------.
# NOTE - most of the examples in this file were created using the data above.
# The stringr package also contains some data that can be used to experiment
# with these functions. 
#-------------------------------------------------------------------------.

stringr::words
  [1] "a"           "able"        "about"       "absolute"    "accept"     
  [6] "account"     "achieve"     "across"      "act"         "active"     
 [11] "actual"      "add"         "address"     "admit"       "advertise"  
 [16] "affect"      "afford"      "after"       "afternoon"   "again"      
 [21] "against"     "age"         "agent"       "ago"         "agree"      
 [26] "air"         "all"         "allow"       "almost"      "along"      
 [31] "already"     "alright"     "also"        "although"    "always"     
 [36] "america"     "amount"      "and"         "another"     "answer"     
 [41] "any"         "apart"       "apparent"    "appear"      "apply"      
 [46] "appoint"     "approach"    "appropriate" "area"        "argue"      
 [51] "arm"         "around"      "arrange"     "art"         "as"         
 [56] "ask"         "associate"   "assume"      "at"          "attend"     
 [61] "authority"   "available"   "aware"       "away"        "awful"      
 [66] "baby"        "back"        "bad"         "bag"         "balance"    
 [71] "ball"        "bank"        "bar"         "base"        "basis"      
 [76] "be"          "bear"        "beat"        "beauty"      "because"    
 [81] "become"      "bed"         "before"      "begin"       "behind"     
 [86] "believe"     "benefit"     "best"        "bet"         "between"    
 [91] "big"         "bill"        "birth"       "bit"         "black"      
 [96] "bloke"       "blood"       "blow"        "blue"        "board"      
[101] "boat"        "body"        "book"        "both"        "bother"     
[106] "bottle"      "bottom"      "box"         "boy"         "break"      
[111] "brief"       "brilliant"   "bring"       "britain"     "brother"    
[116] "budget"      "build"       "bus"         "business"    "busy"       
[121] "but"         "buy"         "by"          "cake"        "call"       
[126] "can"         "car"         "card"        "care"        "carry"      
[131] "case"        "cat"         "catch"       "cause"       "cent"       
[136] "centre"      "certain"     "chair"       "chairman"    "chance"     
[141] "change"      "chap"        "character"   "charge"      "cheap"      
[146] "check"       "child"       "choice"      "choose"      "Christ"     
[151] "Christmas"   "church"      "city"        "claim"       "class"      
[156] "clean"       "clear"       "client"      "clock"       "close"      
[161] "closes"      "clothe"      "club"        "coffee"      "cold"       
[166] "colleague"   "collect"     "college"     "colour"      "come"       
[171] "comment"     "commit"      "committee"   "common"      "community"  
[176] "company"     "compare"     "complete"    "compute"     "concern"    
[181] "condition"   "confer"      "consider"    "consult"     "contact"    
[186] "continue"    "contract"    "control"     "converse"    "cook"       
[191] "copy"        "corner"      "correct"     "cost"        "could"      
[196] "council"     "count"       "country"     "county"      "couple"     
[201] "course"      "court"       "cover"       "create"      "cross"      
[206] "cup"         "current"     "cut"         "dad"         "danger"     
[211] "date"        "day"         "dead"        "deal"        "dear"       
[216] "debate"      "decide"      "decision"    "deep"        "definite"   
[221] "degree"      "department"  "depend"      "describe"    "design"     
[226] "detail"      "develop"     "die"         "difference"  "difficult"  
[231] "dinner"      "direct"      "discuss"     "district"    "divide"     
[236] "do"          "doctor"      "document"    "dog"         "door"       
[241] "double"      "doubt"       "down"        "draw"        "dress"      
[246] "drink"       "drive"       "drop"        "dry"         "due"        
[251] "during"      "each"        "early"       "east"        "easy"       
[256] "eat"         "economy"     "educate"     "effect"      "egg"        
[261] "eight"       "either"      "elect"       "electric"    "eleven"     
[266] "else"        "employ"      "encourage"   "end"         "engine"     
[271] "english"     "enjoy"       "enough"      "enter"       "environment"
[276] "equal"       "especial"    "europe"      "even"        "evening"    
[281] "ever"        "every"       "evidence"    "exact"       "example"    
[286] "except"      "excuse"      "exercise"    "exist"       "expect"     
[291] "expense"     "experience"  "explain"     "express"     "extra"      
[296] "eye"         "face"        "fact"        "fair"        "fall"       
[301] "family"      "far"         "farm"        "fast"        "father"     
[306] "favour"      "feed"        "feel"        "few"         "field"      
[311] "fight"       "figure"      "file"        "fill"        "film"       
[316] "final"       "finance"     "find"        "fine"        "finish"     
[321] "fire"        "first"       "fish"        "fit"         "five"       
[326] "flat"        "floor"       "fly"         "follow"      "food"       
[331] "foot"        "for"         "force"       "forget"      "form"       
[336] "fortune"     "forward"     "four"        "france"      "free"       
[341] "friday"      "friend"      "from"        "front"       "full"       
[346] "fun"         "function"    "fund"        "further"     "future"     
[351] "game"        "garden"      "gas"         "general"     "germany"    
[356] "get"         "girl"        "give"        "glass"       "go"         
[361] "god"         "good"        "goodbye"     "govern"      "grand"      
[366] "grant"       "great"       "green"       "ground"      "group"      
[371] "grow"        "guess"       "guy"         "hair"        "half"       
[376] "hall"        "hand"        "hang"        "happen"      "happy"      
[381] "hard"        "hate"        "have"        "he"          "head"       
[386] "health"      "hear"        "heart"       "heat"        "heavy"      
[391] "hell"        "help"        "here"        "high"        "history"    
[396] "hit"         "hold"        "holiday"     "home"        "honest"     
[401] "hope"        "horse"       "hospital"    "hot"         "hour"       
[406] "house"       "how"         "however"     "hullo"       "hundred"    
[411] "husband"     "idea"        "identify"    "if"          "imagine"    
[416] "important"   "improve"     "in"          "include"     "income"     
[421] "increase"    "indeed"      "individual"  "industry"    "inform"     
[426] "inside"      "instead"     "insure"      "interest"    "into"       
[431] "introduce"   "invest"      "involve"     "issue"       "it"         
[436] "item"        "jesus"       "job"         "join"        "judge"      
[441] "jump"        "just"        "keep"        "key"         "kid"        
[446] "kill"        "kind"        "king"        "kitchen"     "knock"      
[451] "know"        "labour"      "lad"         "lady"        "land"       
[456] "language"    "large"       "last"        "late"        "laugh"      
[461] "law"         "lay"         "lead"        "learn"       "leave"      
[466] "left"        "leg"         "less"        "let"         "letter"     
[471] "level"       "lie"         "life"        "light"       "like"       
[476] "likely"      "limit"       "line"        "link"        "list"       
[481] "listen"      "little"      "live"        "load"        "local"      
[486] "lock"        "london"      "long"        "look"        "lord"       
[491] "lose"        "lot"         "love"        "low"         "luck"       
[496] "lunch"       "machine"     "main"        "major"       "make"       
[501] "man"         "manage"      "many"        "mark"        "market"     
[506] "marry"       "match"       "matter"      "may"         "maybe"      
[511] "mean"        "meaning"     "measure"     "meet"        "member"     
[516] "mention"     "middle"      "might"       "mile"        "milk"       
[521] "million"     "mind"        "minister"    "minus"       "minute"     
[526] "miss"        "mister"      "moment"      "monday"      "money"      
[531] "month"       "more"        "morning"     "most"        "mother"     
[536] "motion"      "move"        "mrs"         "much"        "music"      
[541] "must"        "name"        "nation"      "nature"      "near"       
[546] "necessary"   "need"        "never"       "new"         "news"       
[551] "next"        "nice"        "night"       "nine"        "no"         
[556] "non"         "none"        "normal"      "north"       "not"        
[561] "note"        "notice"      "now"         "number"      "obvious"    
[566] "occasion"    "odd"         "of"          "off"         "offer"      
[571] "office"      "often"       "okay"        "old"         "on"         
[576] "once"        "one"         "only"        "open"        "operate"    
[581] "opportunity" "oppose"      "or"          "order"       "organize"   
[586] "original"    "other"       "otherwise"   "ought"       "out"        
[591] "over"        "own"         "pack"        "page"        "paint"      
[596] "pair"        "paper"       "paragraph"   "pardon"      "parent"     
[601] "park"        "part"        "particular"  "party"       "pass"       
[606] "past"        "pay"         "pence"       "pension"     "people"     
[611] "per"         "percent"     "perfect"     "perhaps"     "period"     
[616] "person"      "photograph"  "pick"        "picture"     "piece"      
[621] "place"       "plan"        "play"        "please"      "plus"       
[626] "point"       "police"      "policy"      "politic"     "poor"       
[631] "position"    "positive"    "possible"    "post"        "pound"      
[636] "power"       "practise"    "prepare"     "present"     "press"      
[641] "pressure"    "presume"     "pretty"      "previous"    "price"      
[646] "print"       "private"     "probable"    "problem"     "proceed"    
[651] "process"     "produce"     "product"     "programme"   "project"    
[656] "proper"      "propose"     "protect"     "provide"     "public"     
[661] "pull"        "purpose"     "push"        "put"         "quality"    
[666] "quarter"     "question"    "quick"       "quid"        "quiet"      
[671] "quite"       "radio"       "rail"        "raise"       "range"      
[676] "rate"        "rather"      "read"        "ready"       "real"       
[681] "realise"     "really"      "reason"      "receive"     "recent"     
[686] "reckon"      "recognize"   "recommend"   "record"      "red"        
[691] "reduce"      "refer"       "regard"      "region"      "relation"   
[696] "remember"    "report"      "represent"   "require"     "research"   
[701] "resource"    "respect"     "responsible" "rest"        "result"     
[706] "return"      "rid"         "right"       "ring"        "rise"       
[711] "road"        "role"        "roll"        "room"        "round"      
[716] "rule"        "run"         "safe"        "sale"        "same"       
[721] "saturday"    "save"        "say"         "scheme"      "school"     
[726] "science"     "score"       "scotland"    "seat"        "second"     
[731] "secretary"   "section"     "secure"      "see"         "seem"       
[736] "self"        "sell"        "send"        "sense"       "separate"   
[741] "serious"     "serve"       "service"     "set"         "settle"     
[746] "seven"       "sex"         "shall"       "share"       "she"        
[751] "sheet"       "shoe"        "shoot"       "shop"        "short"      
[756] "should"      "show"        "shut"        "sick"        "side"       
[761] "sign"        "similar"     "simple"      "since"       "sing"       
[766] "single"      "sir"         "sister"      "sit"         "site"       
[771] "situate"     "six"         "size"        "sleep"       "slight"     
[776] "slow"        "small"       "smoke"       "so"          "social"     
[781] "society"     "some"        "son"         "soon"        "sorry"      
[786] "sort"        "sound"       "south"       "space"       "speak"      
[791] "special"     "specific"    "speed"       "spell"       "spend"      
[796] "square"      "staff"       "stage"       "stairs"      "stand"      
[801] "standard"    "start"       "state"       "station"     "stay"       
[806] "step"        "stick"       "still"       "stop"        "story"      
[811] "straight"    "strategy"    "street"      "strike"      "strong"     
[816] "structure"   "student"     "study"       "stuff"       "stupid"     
[821] "subject"     "succeed"     "such"        "sudden"      "suggest"    
[826] "suit"        "summer"      "sun"         "sunday"      "supply"     
[831] "support"     "suppose"     "sure"        "surprise"    "switch"     
[836] "system"      "table"       "take"        "talk"        "tape"       
[841] "tax"         "tea"         "teach"       "team"        "telephone"  
[846] "television"  "tell"        "ten"         "tend"        "term"       
[851] "terrible"    "test"        "than"        "thank"       "the"        
[856] "then"        "there"       "therefore"   "they"        "thing"      
[861] "think"       "thirteen"    "thirty"      "this"        "thou"       
[866] "though"      "thousand"    "three"       "through"     "throw"      
[871] "thursday"    "tie"         "time"        "to"          "today"      
[876] "together"    "tomorrow"    "tonight"     "too"         "top"        
[881] "total"       "touch"       "toward"      "town"        "trade"      
[886] "traffic"     "train"       "transport"   "travel"      "treat"      
[891] "tree"        "trouble"     "true"        "trust"       "try"        
[896] "tuesday"     "turn"        "twelve"      "twenty"      "two"        
[901] "type"        "under"       "understand"  "union"       "unit"       
[906] "unite"       "university"  "unless"      "until"       "up"         
[911] "upon"        "use"         "usual"       "value"       "various"    
[916] "very"        "video"       "view"        "village"     "visit"      
[921] "vote"        "wage"        "wait"        "walk"        "wall"       
[926] "want"        "war"         "warm"        "wash"        "waste"      
[931] "watch"       "water"       "way"         "we"          "wear"       
[936] "wednesday"   "wee"         "week"        "weigh"       "welcome"    
[941] "well"        "west"        "what"        "when"        "where"      
[946] "whether"     "which"       "while"       "white"       "who"        
[951] "whole"       "why"         "wide"        "wife"        "will"       
[956] "win"         "wind"        "window"      "wish"        "with"       
[961] "within"      "without"     "woman"       "wonder"      "wood"       
[966] "word"        "work"        "world"       "worry"       "worse"      
[971] "worth"       "would"       "write"       "wrong"       "year"       
[976] "yes"         "yesterday"   "yet"         "you"         "young"      
head(stringr::words, 100)
  [1] "a"           "able"        "about"       "absolute"    "accept"     
  [6] "account"     "achieve"     "across"      "act"         "active"     
 [11] "actual"      "add"         "address"     "admit"       "advertise"  
 [16] "affect"      "afford"      "after"       "afternoon"   "again"      
 [21] "against"     "age"         "agent"       "ago"         "agree"      
 [26] "air"         "all"         "allow"       "almost"      "along"      
 [31] "already"     "alright"     "also"        "although"    "always"     
 [36] "america"     "amount"      "and"         "another"     "answer"     
 [41] "any"         "apart"       "apparent"    "appear"      "apply"      
 [46] "appoint"     "approach"    "appropriate" "area"        "argue"      
 [51] "arm"         "around"      "arrange"     "art"         "as"         
 [56] "ask"         "associate"   "assume"      "at"          "attend"     
 [61] "authority"   "available"   "aware"       "away"        "awful"      
 [66] "baby"        "back"        "bad"         "bag"         "balance"    
 [71] "ball"        "bank"        "bar"         "base"        "basis"      
 [76] "be"          "bear"        "beat"        "beauty"      "because"    
 [81] "become"      "bed"         "before"      "begin"       "behind"     
 [86] "believe"     "benefit"     "best"        "bet"         "between"    
 [91] "big"         "bill"        "birth"       "bit"         "black"      
 [96] "bloke"       "blood"       "blow"        "blue"        "board"      

14.5 grep function

# Show all words that
# "start with a p, end with a y (with anything in the middle)"
grep(stringr::words, pattern="^p.*y$", value=TRUE)
[1] "party"  "pay"    "play"   "policy" "pretty"
# Starts with a p, ends with a y, nothing in the middle.
# Only matches "py". 
# There are no words that match.
grep(stringr::words, pattern="^py$", value=TRUE)
character(0)
# match any word that start with p, ends with y and has a single
# character between them
grep(stringr::words, pattern="^p.y$", value=TRUE)
[1] "pay"
# match any word that start with p, ends with y and
# has exactly two characters between them.
grep(stringr::words, pattern="^p..y$", value=TRUE)
[1] "play"
# match any word that start with p, ends with y and
# has exactly four characters between them.
grep(stringr::words, pattern="^p....y$", value=TRUE)
[1] "policy" "pretty"
# match any sequence of characters between the p and the y
grep(stringr::words, pattern="^p.*y$", value=TRUE)
[1] "party"  "pay"    "play"   "policy" "pretty"
# starts with a p
grep(stringr::words, pattern="^p", value=TRUE)
 [1] "pack"       "page"       "paint"      "pair"       "paper"     
 [6] "paragraph"  "pardon"     "parent"     "park"       "part"      
[11] "particular" "party"      "pass"       "past"       "pay"       
[16] "pence"      "pension"    "people"     "per"        "percent"   
[21] "perfect"    "perhaps"    "period"     "person"     "photograph"
[26] "pick"       "picture"    "piece"      "place"      "plan"      
[31] "play"       "please"     "plus"       "point"      "police"    
[36] "policy"     "politic"    "poor"       "position"   "positive"  
[41] "possible"   "post"       "pound"      "power"      "practise"  
[46] "prepare"    "present"    "press"      "pressure"   "presume"   
[51] "pretty"     "previous"   "price"      "print"      "private"   
[56] "probable"   "problem"    "proceed"    "process"    "produce"   
[61] "product"    "programme"  "project"    "proper"     "propose"   
[66] "protect"    "provide"    "public"     "pull"       "purpose"   
[71] "push"       "put"       

14.6 grep and grepl

# When value=FALSE grep returns the positions in the vector of 
# values that matched
grep(stringr::words, pattern="^p.*y$", value=TRUE)
[1] "party"  "pay"    "play"   "policy" "pretty"
grep(stringr::words, pattern="^p.*y$", value=FALSE)
[1] 604 607 623 628 643
# default is value=FALSE
grep(stringr::words, pattern="^p.*y$", value=FALSE)
[1] 604 607 623 628 643
#@ grep and grepl
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@ grep stands for "Globally search for a Regular Expression and Print the result"
#@
#@ Grep will search through the entries in a character vector and display those
#@ entries that match a specified pattern (see examples below). These patterns
#@ are known as regular expressions or "regex".
#@
#@ The history of grep started with a a command that was used on the Unix operating
#@ system. It has been adapted for use with many programming environments. R has
#@ its own version.
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


# grep ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# grep returns character values or the indexes (i.e. position numbers) 
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Find all fruit whose name contains the letter "h"
grep(pattern="h", x=fruit, value=TRUE)   # value=TRUE, show the acutal values that match the pattern 
[1] "cherry"       "black cherry" "peach"        "honeydew"    
grep(pattern="h", x=fruit, value=FALSE)  # value=FALSE, show the index (ie. position) of the values that match 
[1]  9 10 11 17
# grepl ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# grepl returns logical values (i.e. TRUE/FALSE vectors)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

grepl(pattern="h", x=fruit)    # find which values include an "h"
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
[13] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

14.7 Summary: 3 ways to use grep or grepl

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# - grep ( regexPattern , value=TRUE)  # returns the actual values that match  
# - grep ( regexPattern , value=FALSE) # returns the index numbers of the values that match  
# - grepl ( regexPattern )              # returns a logical vector that indicate which values match  
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# For now, let's focus on grep(... , value=TRUE) as it is easier to understand the results. 


# The pattern is searched for in the entire entry ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# The pattern is considered "matched" if it appears anywhere in the data value.
# For example:   grep("h", fruit)
#
# returns all fruit that contain an "h", no matter whether the h is at the 
# beginning, end or middle of the word.
#
# You can change this behavior with the ^ and $ metacharacters (see below)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

14.8 Spaces are NOT ignored.

Spaces count as part of the pattern. They are NOT ignored.

# fruits that contain a space
grep(pattern=" ", x=fruit, value=TRUE) # all fruit that contain a space
[1] "N. American apple" "S. Korean Fig"     "star fruit"       
[4] "prickly pear"      "Beurre Hardy pear" "black cherry"     
# search for "k "  (i.e. k followed by a space)
grep("k ", fruit, value=TRUE) # "black cherry"
[1] "black cherry"
# search for "k"  (i.e. without a space - JUST a "k")
grep("k", fruit, value=TRUE) # "prickly pear"  "black cherry"  "kumquat"
[1] "prickly pear" "black cherry" "kumquat"     
# search for "ck" 
grep("ck", fruit, value=TRUE) # "prickly pear"  "black cherry"  "kumquat"
[1] "prickly pear" "black cherry"

14.9 regex patterns do NOT understand “numbers”

Digits are NOT treated as numbers. They are treated the same as any other character. Therefore grep(“12”, SOME_VECTOR) will match any value that contains a 1 followed by a 2, including “123” and “34321234”.

addresses # show all the addresses
 [1] "12345 Sesame Street"               "One Micro$oft Way"                
 [3] "3 Olive St."                       "Two 1st Ave."                     
 [5] "5678 Park Place"                   "Forty Five 2nd Street"            
 [7] "Ninety Nine Cone St. apartment 7"  "9 Main St. apt. 623"              
 [9] "Five Google Drive"                 "4\\2 Rechov Yafo"                 
[11] "Fifteen Watchamacallit Boulevard"  "Nineteen Watchamacallit Boulevard"
[13] "One Main Street Apt 12b"           "Two Main Street Apt 123c"         
[15] "Three Main Street Apt 12343"       "City Hall Lockport, NY"           
grep("23", addresses, value=TRUE)  # matches anything that contains 23
[1] "12345 Sesame Street"         "9 Main St. apt. 623"        
[3] "Two Main Street Apt 123c"    "Three Main Street Apt 12343"

14.10 case sensitivity

By default, R’s version of grep is case sensitive.

There are a few different approaches for changing the default behavior to instead search case-INsensitively.

14.10.1 (a) case INsensitive searches - use ignore.case argument

The first way - use ignore.case = TRUE. See the code below.

grep("H",fruit, value=TRUE)  # contains a capital "H"
[1] "Beurre Hardy pear"
grep("h",fruit, value=TRUE)  # contains a lowercase "h"
[1] "cherry"       "black cherry" "peach"        "honeydew"    
grep("h", fruit, value=TRUE, ignore.case=TRUE) # contains AnY h
[1] "Beurre Hardy pear" "cherry"            "black cherry"     
[4] "peach"             "honeydew"         
grep("H", fruit, value=TRUE, ignore.case=TRUE) # same thing
[1] "Beurre Hardy pear" "cherry"            "black cherry"     
[4] "peach"             "honeydew"         

14.10.2 (b) case INsensitive searches - character classes - e.g. [aA]

Another way to search for for both CAPITAL and lowercase characters, e.g. [Hh] For example, [hH] indicates that h or H is valid to be matched. We will describe the exact meaning of the [square brackets] in a lot more detail below.

grep("[hH]", fruit, value=TRUE)
[1] "Beurre Hardy pear" "cherry"            "black cherry"     
[4] "peach"             "honeydew"         

14.10.3 (c) case INsensitive searches - use toupper() and tolower()

another way using R’s toupper or tolower functions

msg = "She said 'Hello' to Joe."
msg
[1] "She said 'Hello' to Joe."
toupper(msg)
[1] "SHE SAID 'HELLO' TO JOE."
tolower(msg)
[1] "she said 'hello' to joe."
grep("h", tolower(fruit), value=TRUE)
[1] "beurre hardy pear" "cherry"            "black cherry"     
[4] "peach"             "honeydew"         

14.11 str_view() from the stringr package

The str_view function from the stringr package can be very helpful when you’re trying to understand a regular expression. str_view shows exactly what parts of a string match the pattern. See the example below.

# str_view is part of the stringr package
library(stringr)

greetings = c("hi there", "yo dude", "shalom", "bon jour")
cat(greetings, sep="\n")
hi there
yo dude
shalom
bon jour
# match the letter h in each greeting
str_view(greetings, "h")
[1] │ <h>i t<h>ere
[3] │ s<h>alom

14.12 sub and gsub functions

sub and gsub functions are used to “substitute” the text that was matched by a regular expression with other text.

The difference between sub() and gsub() is that in a single character value, sub() function only substitutes the first part of the character value that matched the regex. By contrast, the gsub() function replaces EVERY part of the character value that matched the regex. (the “g” in “gsub” stands for “global”). See the examples below.

IMPORTANT - both sub and gsub return the ENTIRE vector with only the values matched the regex being changed. This is different from the grep and grepl functions that returned only those entries in the vector that matched the regular expression.

#@ sub and gsub functions ####
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@
#@ sub (SOME_REGEX_PATTERN, REPLACMENT, SOME_VECTOR)
#@    sub returns a new vector. The return value is the same as SOME_VECTOR
#@    except that the FIRST match of the pattern in each entry of SOME_VECTOR
#@    is replaced with REPLACEMENT - see the examples below.
#@
#@ gsub (SOME_REGEX_PATTERN, REPLACMENT, SOME_VECTOR)
#@    same as sub but ALL matches of the pattern are replaced (not just the
#@    first in each entry of the the vector - see the exmaples below
#@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# replace the first letter "e" that appears in any fruit with the letter "X"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ANSWER
sub(pattern="e", replacement="X", x=fruit)  # "applX"    "N. AmXrican apple"     etc
 [1] "applX"             "N. AmXrican apple" "S. KorXan Fig"    
 [4] "fig"               "star fruit"        "pXar"             
 [7] "prickly pXar"      "BXurre Hardy pear" "chXrry"           
[10] "black chXrry"      "pXach"             "plum"             
[13] "kumquat"           "banana"            "bluXberry"        
[16] "strawbXrry"        "honXydew"          "strawbXrries"     
[19] "yumbXrry"         
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# replace ALL of the "e"s that appears in any fruit with the letter "x"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ANSWER
gsub(pattern="e", replacement="X", fruit)   # "applX"    "N. AmXrican applX"     etc
 [1] "applX"             "N. AmXrican applX" "S. KorXan Fig"    
 [4] "fig"               "star fruit"        "pXar"             
 [7] "prickly pXar"      "BXurrX Hardy pXar" "chXrry"           
[10] "black chXrry"      "pXach"             "plum"             
[13] "kumquat"           "banana"            "bluXbXrry"        
[16] "strawbXrry"        "honXydXw"          "strawbXrriXs"     
[19] "yumbXrry"         
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# remove all spaces from the addresses
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ANSWER
gsub(pattern=" ", replacement="", addresses)   # "12345SesameStreet"   "OneMicro$oftWay"   etc.
 [1] "12345SesameStreet"               "OneMicro$oftWay"                
 [3] "3OliveSt."                       "Two1stAve."                     
 [5] "5678ParkPlace"                   "FortyFive2ndStreet"             
 [7] "NinetyNineConeSt.apartment7"     "9MainSt.apt.623"                
 [9] "FiveGoogleDrive"                 "4\\2RechovYafo"                 
[11] "FifteenWatchamacallitBoulevard"  "NineteenWatchamacallitBoulevard"
[13] "OneMainStreetApt12b"             "TwoMainStreetApt123c"           
[15] "ThreeMainStreetApt12343"         "CityHallLockport,NY"            

We will revisit sub and gsub later with more complex examples …

14.13 strsplit function

strsplit() is used to split a string based on a “delimeter” that appears between the different values. This “delimeter” can be a regular expression. We’ll come back to strsplit later, but let’s introduce it here.

sentences
[1] "He said hi. She said bye. We went to the park."
[2] "I like ice cream! Do you? Sue likes pizza."    
#------------------------------------------------------------------------.
# QUESTION - 
# Use strsplit to split the values in the sentences vector by 
# splitting based on spaces. Assign the result to the varible "sentenceWords".
#
# Write code to get the 3rd "word" from the 1st entry in the sentences 
# vector.
#------------------------------------------------------------------------.

# ANSWER
sentenceWords = strsplit(sentences, split=" ")
sentenceWords
[[1]]
 [1] "He"    "said"  "hi."   "She"   "said"  "bye."  "We"    "went"  "to"   
[10] "the"   "park."

[[2]]
[1] "I"      "like"   "ice"    "cream!" "Do"     "you?"   "Sue"    "likes" 
[9] "pizza."
# Notice that the result is a LIST:
str(sentenceWords)
List of 2
 $ : chr [1:11] "He" "said" "hi." "She" ...
 $ : chr [1:9] "I" "like" "ice" "cream!" ...
# Show the 3rd word in the 1st sentence
sentenceWords[[1]][3]
[1] "hi."

— practice —

#------------------------------------------------------------------------.
# QUESTION - split each entry in the sentences variable into individual 
# sententces. 
# 
# WARNING - the value of the split argument is interpreted as a
# regular expression pattern. Be careful.
#------------------------------------------------------------------------.

# ANSWER

# 1st attempt - doesn't work. 
strsplit(sentences, ".")
[[1]]
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[26] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

[[2]]
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[26] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
# This doesn't work since the 2nd argument is a regular expression.
# The following will split based on periods.
sentences
[1] "He said hi. She said bye. We went to the park."
[2] "I like ice cream! Do you? Sue likes pizza."    
strsplit(sentences, "\\.")
[[1]]
[1] "He said hi"           " She said bye"        " We went to the park"

[[2]]
[1] "I like ice cream! Do you? Sue likes pizza"
# Use a "regular expression" to instead split on any of a period, 
# question mark, or exclamation point.
sentences
[1] "He said hi. She said bye. We went to the park."
[2] "I like ice cream! Do you? Sue likes pizza."    
strsplit(sentences, "[.?!]")  # split on any one of .?!
[[1]]
[1] "He said hi"           " She said bye"        " We went to the park"

[[2]]
[1] "I like ice cream" " Do you"          " Sue likes pizza"

14.14 Other functions: regmatches, regexec, regexpr, gregexpr

The following are other function in Base R that use regex. These are a little more advanced. It’s probably better to try researching these functions after first understanding the material presented in this section.

You can search online or see the R documentation for more info about these functions.

  • regmatches
  • regexec
  • regexpr
  • gregexpr

14.15 stringr functions

The stringr package includes many functions for use with character vectors. One example is str_length, which is very similar to the nchar() function in Base R.

# The str_length function is part of the stringr package.
# To use it you must install stringr (or install tidyverse, which is a 
# collection of packages one of which is stringr)
str_length(c("abc", "hello", "I like ice cream!"))
[1]  3  5 17
# This function is very similar to the
# nchar function that is built into base R.
nchar(c("abc", "hello", "I like ice cream!"))  
[1]  3  5 17

The stringr package also includes numerous functions that make use of regular expressions. The following is a table of the stringr functions and the Base R functions that can be used to accomplish similar things.

stringr Base R Description
str_detect() grepl() Returns TRUE/FALSE if pattern is found
str_extract(), str_extract_all() regmatches() Extract matching patterns
str_match(), str_match_all() regexec(), regmatches() Extract matched groups
str_replace(), str_replace_all() sub(), gsub() Replace matched patterns
str_split() strsplit() Split string on pattern
str_subset() grep(value = TRUE) Keep strings matching pattern
str_locate(), str_locate_all() regexpr(), gregexpr() Find positions of matches
str_count() lengths(regmatches()) Count pattern occurrences