rm(list=ls())14 14. Intro to Regular Expressions (regex)
14.1 Intro to Regular Expressions (Also Known As “regex”)
Regular expressions (AKA “regex”) are patterns that are used to identify specific types of data. They can be used to search through and slice and dice character (i.e. textual) data in a variety of ways. The best way to understand is just to dive right in, which we will do below.
Before we get started it’s important to note that regular expressions are not specic to R. Rather regular expressions are a concept that are available to use in most programming languages (e.g. Python, Java, etc.) as well as in many text editors and in other technical environments (e.g. Bash). There are slightly different versions of regular expression syntax found in different environments. We will be focusing mainly on how to use regular expressions in R. However, this knowledge carries over directly to using regex in other languages and technical environments.
14.2 Some resources for learning regular expressions
The following is one of the many online tutorials.
The following R help pages are relevant to regular expressions. However, they are hard to understand without a little intro first. I recommend that you go through the material in this book first or in other tutorials. Then you can refer to the help pages when necessary.
Relevant R help pages
?regex
?grep
?strsplit R’s stringr package also includes several functions that can be used with regex. For now though, we will focus mainly on the “Base R” regex functions. Once you understand those, you will be very prepared to understand functions in the stringr package that utilize regular expressions.
14.3 Data for examples
Before we start, let’s define some data to be used with examples in this file. (NOTE: I made up N. American apple and S. Korean Fig so that I can use *them in some examples.)
14.3.1 fruit vector
fruit [1] "apple" "N. American apple" "S. Korean Fig"
[4] "fig" "star fruit" "pear"
[7] "prickly pear" "Beurre Hardy pear" "cherry"
[10] "black cherry" "peach" "plum"
[13] "kumquat" "banana" "blueberry"
[16] "strawberry" "honeydew" "strawberries"
[19] "yumberry"
14.3.2 addresses vector
addresses [1] "12345 Sesame Street" "One Micro$oft Way"
[3] "3 Olive St." "Two 1st Ave."
[5] "5678 Park Place" "Forty Five 2nd Street"
[7] "Ninety Nine Cone St. apartment 7" "9 Main St. apt. 623"
[9] "Five Google Drive" "4\\2 Rechov Yafo"
[11] "Fifteen Watchamacallit Boulevard" "Nineteen Watchamacallit Boulevard"
[13] "One Main Street Apt 12b" "Two Main Street Apt 123c"
[15] "Three Main Street Apt 12343" "City Hall Lockport, NY"
# show each address, one per line
cat(addresses, sep="\n")12345 Sesame Street
One Micro$oft Way
3 Olive St.
Two 1st Ave.
5678 Park Place
Forty Five 2nd Street
Ninety Nine Cone St. apartment 7
9 Main St. apt. 623
Five Google Drive
4\2 Rechov Yafo
Fifteen Watchamacallit Boulevard
Nineteen Watchamacallit Boulevard
One Main Street Apt 12b
Two Main Street Apt 123c
Three Main Street Apt 12343
City Hall Lockport, NY
14.3.3 famousQuotes dataframe
famousQuotesDf Year Speaker
1 -47 Julius Caesar
2 -399 Socrates
3 -399 Socrates
4 1597 Francis Bacon
5 1603 William Shakespeare
6 1637 René Descartes
7 1759 Voltaire
8 1775 Patrick Henry
9 1789 Marie Antoinette
10 1843 Karl Marx
11 1848 Karl Marx
12 1858 Abraham Lincoln
13 1863 Abraham Lincoln
14 1887 Lord Acton
15 1901 Theodore Roosevelt
16 1929 Sherlock Holmes (Arthur Conan Doyle)
17 1929 Albert Einstein
18 1933 Franklin D. Roosevelt
19 1940 Winston Churchill
20 1940 Eleanor Roosevelt
21 1941 Winston Churchill
22 1942 Douglas MacArthur
23 1945 Harry S. Truman
24 1947 Mahatma Gandhi
25 1949 George Orwell
26 1951 Albert Camus
27 1961 John F. Kennedy
28 1963 Martin Luther King Jr.
29 1964 Muhammad Ali
30 1964 Marshall McLuhan
31 1968 Martin Luther King Jr.
32 1969 Neil Armstrong
33 1970 Irina Dunn
34 1980 John Lennon
35 1980 Margaret Thatcher
36 1987 Ronald Reagan
37 1988 George H. W. Bush
38 2008 Barack Obama
Quote
1 I came, I saw, I conquered
2 I know that I know nothing
3 The unexamined life is not worth living
4 Knowledge is power
5 To be or not to be, that is the question
6 I think, therefore I am
7 I disapprove of what you say, but I will defend to the death your right to say it
8 Give me liberty, or give me death
9 Let them eat cake
10 Religion is the opium of the people
11 Workers of the world, unite!
12 A house divided against itself cannot stand
13 Government of the people, by the people, for the people
14 Power tends to corrupt, and absolute power corrupts absolutely
15 Speak softly and carry a big stick
16 Elementary, my dear Watson
17 Imagination is more important than knowledge
18 The only thing we have to fear is fear itself
19 We shall fight on the beaches
20 The future belongs to those who believe in the beauty of their dreams
21 Never, never, never give up
22 I shall return
23 The buck stops here
24 Be the change you wish to see in the world
25 War is peace. Freedom is slavery. Ignorance is strength
26 The only way to deal with an unfree world is to become so absolutely free that your very existence is an act of rebellion
27 Ask not what your country can do for you, ask what you can do for your country
28 I have a dream
29 Float like a butterfly, sting like a bee
30 The medium is the message
31 I've been to the mountaintop
32 That's one small step for man, one giant leap for mankind
33 A woman needs a man like a fish needs a bicycle
34 Life is what happens while you're busy making other plans
35 There is no alternative
36 Mr. Gorbachev, tear down this wall
37 Read my lips: no new taxes
38 Yes we can
14.4 The stringr package
14.4.1 stringr::words vector
The stringr package includes some data that we will use in the examples.
#-------------------------------------------------------------------------.
# NOTE - most of the examples in this file were created using the data above.
# The stringr package also contains some data that can be used to experiment
# with these functions.
#-------------------------------------------------------------------------.
stringr::words [1] "a" "able" "about" "absolute" "accept"
[6] "account" "achieve" "across" "act" "active"
[11] "actual" "add" "address" "admit" "advertise"
[16] "affect" "afford" "after" "afternoon" "again"
[21] "against" "age" "agent" "ago" "agree"
[26] "air" "all" "allow" "almost" "along"
[31] "already" "alright" "also" "although" "always"
[36] "america" "amount" "and" "another" "answer"
[41] "any" "apart" "apparent" "appear" "apply"
[46] "appoint" "approach" "appropriate" "area" "argue"
[51] "arm" "around" "arrange" "art" "as"
[56] "ask" "associate" "assume" "at" "attend"
[61] "authority" "available" "aware" "away" "awful"
[66] "baby" "back" "bad" "bag" "balance"
[71] "ball" "bank" "bar" "base" "basis"
[76] "be" "bear" "beat" "beauty" "because"
[81] "become" "bed" "before" "begin" "behind"
[86] "believe" "benefit" "best" "bet" "between"
[91] "big" "bill" "birth" "bit" "black"
[96] "bloke" "blood" "blow" "blue" "board"
[101] "boat" "body" "book" "both" "bother"
[106] "bottle" "bottom" "box" "boy" "break"
[111] "brief" "brilliant" "bring" "britain" "brother"
[116] "budget" "build" "bus" "business" "busy"
[121] "but" "buy" "by" "cake" "call"
[126] "can" "car" "card" "care" "carry"
[131] "case" "cat" "catch" "cause" "cent"
[136] "centre" "certain" "chair" "chairman" "chance"
[141] "change" "chap" "character" "charge" "cheap"
[146] "check" "child" "choice" "choose" "Christ"
[151] "Christmas" "church" "city" "claim" "class"
[156] "clean" "clear" "client" "clock" "close"
[161] "closes" "clothe" "club" "coffee" "cold"
[166] "colleague" "collect" "college" "colour" "come"
[171] "comment" "commit" "committee" "common" "community"
[176] "company" "compare" "complete" "compute" "concern"
[181] "condition" "confer" "consider" "consult" "contact"
[186] "continue" "contract" "control" "converse" "cook"
[191] "copy" "corner" "correct" "cost" "could"
[196] "council" "count" "country" "county" "couple"
[201] "course" "court" "cover" "create" "cross"
[206] "cup" "current" "cut" "dad" "danger"
[211] "date" "day" "dead" "deal" "dear"
[216] "debate" "decide" "decision" "deep" "definite"
[221] "degree" "department" "depend" "describe" "design"
[226] "detail" "develop" "die" "difference" "difficult"
[231] "dinner" "direct" "discuss" "district" "divide"
[236] "do" "doctor" "document" "dog" "door"
[241] "double" "doubt" "down" "draw" "dress"
[246] "drink" "drive" "drop" "dry" "due"
[251] "during" "each" "early" "east" "easy"
[256] "eat" "economy" "educate" "effect" "egg"
[261] "eight" "either" "elect" "electric" "eleven"
[266] "else" "employ" "encourage" "end" "engine"
[271] "english" "enjoy" "enough" "enter" "environment"
[276] "equal" "especial" "europe" "even" "evening"
[281] "ever" "every" "evidence" "exact" "example"
[286] "except" "excuse" "exercise" "exist" "expect"
[291] "expense" "experience" "explain" "express" "extra"
[296] "eye" "face" "fact" "fair" "fall"
[301] "family" "far" "farm" "fast" "father"
[306] "favour" "feed" "feel" "few" "field"
[311] "fight" "figure" "file" "fill" "film"
[316] "final" "finance" "find" "fine" "finish"
[321] "fire" "first" "fish" "fit" "five"
[326] "flat" "floor" "fly" "follow" "food"
[331] "foot" "for" "force" "forget" "form"
[336] "fortune" "forward" "four" "france" "free"
[341] "friday" "friend" "from" "front" "full"
[346] "fun" "function" "fund" "further" "future"
[351] "game" "garden" "gas" "general" "germany"
[356] "get" "girl" "give" "glass" "go"
[361] "god" "good" "goodbye" "govern" "grand"
[366] "grant" "great" "green" "ground" "group"
[371] "grow" "guess" "guy" "hair" "half"
[376] "hall" "hand" "hang" "happen" "happy"
[381] "hard" "hate" "have" "he" "head"
[386] "health" "hear" "heart" "heat" "heavy"
[391] "hell" "help" "here" "high" "history"
[396] "hit" "hold" "holiday" "home" "honest"
[401] "hope" "horse" "hospital" "hot" "hour"
[406] "house" "how" "however" "hullo" "hundred"
[411] "husband" "idea" "identify" "if" "imagine"
[416] "important" "improve" "in" "include" "income"
[421] "increase" "indeed" "individual" "industry" "inform"
[426] "inside" "instead" "insure" "interest" "into"
[431] "introduce" "invest" "involve" "issue" "it"
[436] "item" "jesus" "job" "join" "judge"
[441] "jump" "just" "keep" "key" "kid"
[446] "kill" "kind" "king" "kitchen" "knock"
[451] "know" "labour" "lad" "lady" "land"
[456] "language" "large" "last" "late" "laugh"
[461] "law" "lay" "lead" "learn" "leave"
[466] "left" "leg" "less" "let" "letter"
[471] "level" "lie" "life" "light" "like"
[476] "likely" "limit" "line" "link" "list"
[481] "listen" "little" "live" "load" "local"
[486] "lock" "london" "long" "look" "lord"
[491] "lose" "lot" "love" "low" "luck"
[496] "lunch" "machine" "main" "major" "make"
[501] "man" "manage" "many" "mark" "market"
[506] "marry" "match" "matter" "may" "maybe"
[511] "mean" "meaning" "measure" "meet" "member"
[516] "mention" "middle" "might" "mile" "milk"
[521] "million" "mind" "minister" "minus" "minute"
[526] "miss" "mister" "moment" "monday" "money"
[531] "month" "more" "morning" "most" "mother"
[536] "motion" "move" "mrs" "much" "music"
[541] "must" "name" "nation" "nature" "near"
[546] "necessary" "need" "never" "new" "news"
[551] "next" "nice" "night" "nine" "no"
[556] "non" "none" "normal" "north" "not"
[561] "note" "notice" "now" "number" "obvious"
[566] "occasion" "odd" "of" "off" "offer"
[571] "office" "often" "okay" "old" "on"
[576] "once" "one" "only" "open" "operate"
[581] "opportunity" "oppose" "or" "order" "organize"
[586] "original" "other" "otherwise" "ought" "out"
[591] "over" "own" "pack" "page" "paint"
[596] "pair" "paper" "paragraph" "pardon" "parent"
[601] "park" "part" "particular" "party" "pass"
[606] "past" "pay" "pence" "pension" "people"
[611] "per" "percent" "perfect" "perhaps" "period"
[616] "person" "photograph" "pick" "picture" "piece"
[621] "place" "plan" "play" "please" "plus"
[626] "point" "police" "policy" "politic" "poor"
[631] "position" "positive" "possible" "post" "pound"
[636] "power" "practise" "prepare" "present" "press"
[641] "pressure" "presume" "pretty" "previous" "price"
[646] "print" "private" "probable" "problem" "proceed"
[651] "process" "produce" "product" "programme" "project"
[656] "proper" "propose" "protect" "provide" "public"
[661] "pull" "purpose" "push" "put" "quality"
[666] "quarter" "question" "quick" "quid" "quiet"
[671] "quite" "radio" "rail" "raise" "range"
[676] "rate" "rather" "read" "ready" "real"
[681] "realise" "really" "reason" "receive" "recent"
[686] "reckon" "recognize" "recommend" "record" "red"
[691] "reduce" "refer" "regard" "region" "relation"
[696] "remember" "report" "represent" "require" "research"
[701] "resource" "respect" "responsible" "rest" "result"
[706] "return" "rid" "right" "ring" "rise"
[711] "road" "role" "roll" "room" "round"
[716] "rule" "run" "safe" "sale" "same"
[721] "saturday" "save" "say" "scheme" "school"
[726] "science" "score" "scotland" "seat" "second"
[731] "secretary" "section" "secure" "see" "seem"
[736] "self" "sell" "send" "sense" "separate"
[741] "serious" "serve" "service" "set" "settle"
[746] "seven" "sex" "shall" "share" "she"
[751] "sheet" "shoe" "shoot" "shop" "short"
[756] "should" "show" "shut" "sick" "side"
[761] "sign" "similar" "simple" "since" "sing"
[766] "single" "sir" "sister" "sit" "site"
[771] "situate" "six" "size" "sleep" "slight"
[776] "slow" "small" "smoke" "so" "social"
[781] "society" "some" "son" "soon" "sorry"
[786] "sort" "sound" "south" "space" "speak"
[791] "special" "specific" "speed" "spell" "spend"
[796] "square" "staff" "stage" "stairs" "stand"
[801] "standard" "start" "state" "station" "stay"
[806] "step" "stick" "still" "stop" "story"
[811] "straight" "strategy" "street" "strike" "strong"
[816] "structure" "student" "study" "stuff" "stupid"
[821] "subject" "succeed" "such" "sudden" "suggest"
[826] "suit" "summer" "sun" "sunday" "supply"
[831] "support" "suppose" "sure" "surprise" "switch"
[836] "system" "table" "take" "talk" "tape"
[841] "tax" "tea" "teach" "team" "telephone"
[846] "television" "tell" "ten" "tend" "term"
[851] "terrible" "test" "than" "thank" "the"
[856] "then" "there" "therefore" "they" "thing"
[861] "think" "thirteen" "thirty" "this" "thou"
[866] "though" "thousand" "three" "through" "throw"
[871] "thursday" "tie" "time" "to" "today"
[876] "together" "tomorrow" "tonight" "too" "top"
[881] "total" "touch" "toward" "town" "trade"
[886] "traffic" "train" "transport" "travel" "treat"
[891] "tree" "trouble" "true" "trust" "try"
[896] "tuesday" "turn" "twelve" "twenty" "two"
[901] "type" "under" "understand" "union" "unit"
[906] "unite" "university" "unless" "until" "up"
[911] "upon" "use" "usual" "value" "various"
[916] "very" "video" "view" "village" "visit"
[921] "vote" "wage" "wait" "walk" "wall"
[926] "want" "war" "warm" "wash" "waste"
[931] "watch" "water" "way" "we" "wear"
[936] "wednesday" "wee" "week" "weigh" "welcome"
[941] "well" "west" "what" "when" "where"
[946] "whether" "which" "while" "white" "who"
[951] "whole" "why" "wide" "wife" "will"
[956] "win" "wind" "window" "wish" "with"
[961] "within" "without" "woman" "wonder" "wood"
[966] "word" "work" "world" "worry" "worse"
[971] "worth" "would" "write" "wrong" "year"
[976] "yes" "yesterday" "yet" "you" "young"
head(stringr::words, 100) [1] "a" "able" "about" "absolute" "accept"
[6] "account" "achieve" "across" "act" "active"
[11] "actual" "add" "address" "admit" "advertise"
[16] "affect" "afford" "after" "afternoon" "again"
[21] "against" "age" "agent" "ago" "agree"
[26] "air" "all" "allow" "almost" "along"
[31] "already" "alright" "also" "although" "always"
[36] "america" "amount" "and" "another" "answer"
[41] "any" "apart" "apparent" "appear" "apply"
[46] "appoint" "approach" "appropriate" "area" "argue"
[51] "arm" "around" "arrange" "art" "as"
[56] "ask" "associate" "assume" "at" "attend"
[61] "authority" "available" "aware" "away" "awful"
[66] "baby" "back" "bad" "bag" "balance"
[71] "ball" "bank" "bar" "base" "basis"
[76] "be" "bear" "beat" "beauty" "because"
[81] "become" "bed" "before" "begin" "behind"
[86] "believe" "benefit" "best" "bet" "between"
[91] "big" "bill" "birth" "bit" "black"
[96] "bloke" "blood" "blow" "blue" "board"
14.5 grep function
# Show all words that
# "start with a p, end with a y (with anything in the middle)"
grep(stringr::words, pattern="^p.*y$", value=TRUE)[1] "party" "pay" "play" "policy" "pretty"
# Starts with a p, ends with a y, nothing in the middle.
# Only matches "py".
# There are no words that match.
grep(stringr::words, pattern="^py$", value=TRUE)character(0)
# match any word that start with p, ends with y and has a single
# character between them
grep(stringr::words, pattern="^p.y$", value=TRUE)[1] "pay"
# match any word that start with p, ends with y and
# has exactly two characters between them.
grep(stringr::words, pattern="^p..y$", value=TRUE)[1] "play"
# match any word that start with p, ends with y and
# has exactly four characters between them.
grep(stringr::words, pattern="^p....y$", value=TRUE)[1] "policy" "pretty"
# match any sequence of characters between the p and the y
grep(stringr::words, pattern="^p.*y$", value=TRUE)[1] "party" "pay" "play" "policy" "pretty"
# starts with a p
grep(stringr::words, pattern="^p", value=TRUE) [1] "pack" "page" "paint" "pair" "paper"
[6] "paragraph" "pardon" "parent" "park" "part"
[11] "particular" "party" "pass" "past" "pay"
[16] "pence" "pension" "people" "per" "percent"
[21] "perfect" "perhaps" "period" "person" "photograph"
[26] "pick" "picture" "piece" "place" "plan"
[31] "play" "please" "plus" "point" "police"
[36] "policy" "politic" "poor" "position" "positive"
[41] "possible" "post" "pound" "power" "practise"
[46] "prepare" "present" "press" "pressure" "presume"
[51] "pretty" "previous" "price" "print" "private"
[56] "probable" "problem" "proceed" "process" "produce"
[61] "product" "programme" "project" "proper" "propose"
[66] "protect" "provide" "public" "pull" "purpose"
[71] "push" "put"
14.6 grep and grepl
# When value=FALSE grep returns the positions in the vector of
# values that matched
grep(stringr::words, pattern="^p.*y$", value=TRUE)[1] "party" "pay" "play" "policy" "pretty"
grep(stringr::words, pattern="^p.*y$", value=FALSE)[1] 604 607 623 628 643
# default is value=FALSE
grep(stringr::words, pattern="^p.*y$", value=FALSE)[1] 604 607 623 628 643
#@ grep and grepl
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@ grep stands for "Globally search for a Regular Expression and Print the result"
#@
#@ Grep will search through the entries in a character vector and display those
#@ entries that match a specified pattern (see examples below). These patterns
#@ are known as regular expressions or "regex".
#@
#@ The history of grep started with a a command that was used on the Unix operating
#@ system. It has been adapted for use with many programming environments. R has
#@ its own version.
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# grep ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# grep returns character values or the indexes (i.e. position numbers)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Find all fruit whose name contains the letter "h"
grep(pattern="h", x=fruit, value=TRUE) # value=TRUE, show the acutal values that match the pattern [1] "cherry" "black cherry" "peach" "honeydew"
grep(pattern="h", x=fruit, value=FALSE) # value=FALSE, show the index (ie. position) of the values that match [1] 9 10 11 17
# grepl ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# grepl returns logical values (i.e. TRUE/FALSE vectors)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
grepl(pattern="h", x=fruit) # find which values include an "h" [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
[13] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
14.7 Summary: 3 ways to use grep or grepl
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# - grep ( regexPattern , value=TRUE) # returns the actual values that match
# - grep ( regexPattern , value=FALSE) # returns the index numbers of the values that match
# - grepl ( regexPattern ) # returns a logical vector that indicate which values match
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~# For now, let's focus on grep(... , value=TRUE) as it is easier to understand the results.
# The pattern is searched for in the entire entry ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# The pattern is considered "matched" if it appears anywhere in the data value.
# For example: grep("h", fruit)
#
# returns all fruit that contain an "h", no matter whether the h is at the
# beginning, end or middle of the word.
#
# You can change this behavior with the ^ and $ metacharacters (see below)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~14.8 Spaces are NOT ignored.
Spaces count as part of the pattern. They are NOT ignored.
# fruits that contain a space
grep(pattern=" ", x=fruit, value=TRUE) # all fruit that contain a space[1] "N. American apple" "S. Korean Fig" "star fruit"
[4] "prickly pear" "Beurre Hardy pear" "black cherry"
# search for "k " (i.e. k followed by a space)
grep("k ", fruit, value=TRUE) # "black cherry"[1] "black cherry"
# search for "k" (i.e. without a space - JUST a "k")
grep("k", fruit, value=TRUE) # "prickly pear" "black cherry" "kumquat"[1] "prickly pear" "black cherry" "kumquat"
# search for "ck"
grep("ck", fruit, value=TRUE) # "prickly pear" "black cherry" "kumquat"[1] "prickly pear" "black cherry"
14.9 regex patterns do NOT understand “numbers”
Digits are NOT treated as numbers. They are treated the same as any other character. Therefore grep(“12”, SOME_VECTOR) will match any value that contains a 1 followed by a 2, including “123” and “34321234”.
addresses # show all the addresses [1] "12345 Sesame Street" "One Micro$oft Way"
[3] "3 Olive St." "Two 1st Ave."
[5] "5678 Park Place" "Forty Five 2nd Street"
[7] "Ninety Nine Cone St. apartment 7" "9 Main St. apt. 623"
[9] "Five Google Drive" "4\\2 Rechov Yafo"
[11] "Fifteen Watchamacallit Boulevard" "Nineteen Watchamacallit Boulevard"
[13] "One Main Street Apt 12b" "Two Main Street Apt 123c"
[15] "Three Main Street Apt 12343" "City Hall Lockport, NY"
grep("23", addresses, value=TRUE) # matches anything that contains 23[1] "12345 Sesame Street" "9 Main St. apt. 623"
[3] "Two Main Street Apt 123c" "Three Main Street Apt 12343"
14.10 case sensitivity
By default, R’s version of grep is case sensitive.
There are a few different approaches for changing the default behavior to instead search case-INsensitively.
14.10.1 (a) case INsensitive searches - use ignore.case argument
The first way - use ignore.case = TRUE. See the code below.
grep("H",fruit, value=TRUE) # contains a capital "H"[1] "Beurre Hardy pear"
grep("h",fruit, value=TRUE) # contains a lowercase "h"[1] "cherry" "black cherry" "peach" "honeydew"
grep("h", fruit, value=TRUE, ignore.case=TRUE) # contains AnY h[1] "Beurre Hardy pear" "cherry" "black cherry"
[4] "peach" "honeydew"
grep("H", fruit, value=TRUE, ignore.case=TRUE) # same thing[1] "Beurre Hardy pear" "cherry" "black cherry"
[4] "peach" "honeydew"
14.10.2 (b) case INsensitive searches - character classes - e.g. [aA]
Another way to search for for both CAPITAL and lowercase characters, e.g. [Hh] For example, [hH] indicates that h or H is valid to be matched. We will describe the exact meaning of the [square brackets] in a lot more detail below.
grep("[hH]", fruit, value=TRUE)[1] "Beurre Hardy pear" "cherry" "black cherry"
[4] "peach" "honeydew"
14.10.3 (c) case INsensitive searches - use toupper() and tolower()
another way using R’s toupper or tolower functions
msg = "She said 'Hello' to Joe."
msg[1] "She said 'Hello' to Joe."
toupper(msg)[1] "SHE SAID 'HELLO' TO JOE."
tolower(msg)[1] "she said 'hello' to joe."
grep("h", tolower(fruit), value=TRUE)[1] "beurre hardy pear" "cherry" "black cherry"
[4] "peach" "honeydew"
14.11 str_view() from the stringr package
The str_view function from the stringr package can be very helpful when you’re trying to understand a regular expression. str_view shows exactly what parts of a string match the pattern. See the example below.
# str_view is part of the stringr package
library(stringr)
greetings = c("hi there", "yo dude", "shalom", "bon jour")
cat(greetings, sep="\n")hi there
yo dude
shalom
bon jour
# match the letter h in each greeting
str_view(greetings, "h")[1] │ <h>i t<h>ere
[3] │ s<h>alom
14.12 sub and gsub functions
sub and gsub functions are used to “substitute” the text that was matched by a regular expression with other text.
The difference between sub() and gsub() is that in a single character value, sub() function only substitutes the first part of the character value that matched the regex. By contrast, the gsub() function replaces EVERY part of the character value that matched the regex. (the “g” in “gsub” stands for “global”). See the examples below.
IMPORTANT - both sub and gsub return the ENTIRE vector with only the values matched the regex being changed. This is different from the grep and grepl functions that returned only those entries in the vector that matched the regular expression.
#@ sub and gsub functions ####
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@
#@ sub (SOME_REGEX_PATTERN, REPLACMENT, SOME_VECTOR)
#@ sub returns a new vector. The return value is the same as SOME_VECTOR
#@ except that the FIRST match of the pattern in each entry of SOME_VECTOR
#@ is replaced with REPLACEMENT - see the examples below.
#@
#@ gsub (SOME_REGEX_PATTERN, REPLACMENT, SOME_VECTOR)
#@ same as sub but ALL matches of the pattern are replaced (not just the
#@ first in each entry of the the vector - see the exmaples below
#@
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# replace the first letter "e" that appears in any fruit with the letter "X"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ANSWER
sub(pattern="e", replacement="X", x=fruit) # "applX" "N. AmXrican apple" etc [1] "applX" "N. AmXrican apple" "S. KorXan Fig"
[4] "fig" "star fruit" "pXar"
[7] "prickly pXar" "BXurre Hardy pear" "chXrry"
[10] "black chXrry" "pXach" "plum"
[13] "kumquat" "banana" "bluXberry"
[16] "strawbXrry" "honXydew" "strawbXrries"
[19] "yumbXrry"
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# replace ALL of the "e"s that appears in any fruit with the letter "x"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ANSWER
gsub(pattern="e", replacement="X", fruit) # "applX" "N. AmXrican applX" etc [1] "applX" "N. AmXrican applX" "S. KorXan Fig"
[4] "fig" "star fruit" "pXar"
[7] "prickly pXar" "BXurrX Hardy pXar" "chXrry"
[10] "black chXrry" "pXach" "plum"
[13] "kumquat" "banana" "bluXbXrry"
[16] "strawbXrry" "honXydXw" "strawbXrriXs"
[19] "yumbXrry"
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# remove all spaces from the addresses
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ANSWER
gsub(pattern=" ", replacement="", addresses) # "12345SesameStreet" "OneMicro$oftWay" etc. [1] "12345SesameStreet" "OneMicro$oftWay"
[3] "3OliveSt." "Two1stAve."
[5] "5678ParkPlace" "FortyFive2ndStreet"
[7] "NinetyNineConeSt.apartment7" "9MainSt.apt.623"
[9] "FiveGoogleDrive" "4\\2RechovYafo"
[11] "FifteenWatchamacallitBoulevard" "NineteenWatchamacallitBoulevard"
[13] "OneMainStreetApt12b" "TwoMainStreetApt123c"
[15] "ThreeMainStreetApt12343" "CityHallLockport,NY"
We will revisit sub and gsub later with more complex examples …
14.13 strsplit function
strsplit() is used to split a string based on a “delimeter” that appears between the different values. This “delimeter” can be a regular expression. We’ll come back to strsplit later, but let’s introduce it here.
sentences[1] "He said hi. She said bye. We went to the park."
[2] "I like ice cream! Do you? Sue likes pizza."
#------------------------------------------------------------------------.
# QUESTION -
# Use strsplit to split the values in the sentences vector by
# splitting based on spaces. Assign the result to the varible "sentenceWords".
#
# Write code to get the 3rd "word" from the 1st entry in the sentences
# vector.
#------------------------------------------------------------------------.
# ANSWER
sentenceWords = strsplit(sentences, split=" ")
sentenceWords[[1]]
[1] "He" "said" "hi." "She" "said" "bye." "We" "went" "to"
[10] "the" "park."
[[2]]
[1] "I" "like" "ice" "cream!" "Do" "you?" "Sue" "likes"
[9] "pizza."
# Notice that the result is a LIST:
str(sentenceWords)List of 2
$ : chr [1:11] "He" "said" "hi." "She" ...
$ : chr [1:9] "I" "like" "ice" "cream!" ...
# Show the 3rd word in the 1st sentence
sentenceWords[[1]][3][1] "hi."
— practice —
#------------------------------------------------------------------------.
# QUESTION - split each entry in the sentences variable into individual
# sententces.
#
# WARNING - the value of the split argument is interpreted as a
# regular expression pattern. Be careful.
#------------------------------------------------------------------------.
# ANSWER
# 1st attempt - doesn't work.
strsplit(sentences, ".")[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[26] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[[2]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[26] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
# This doesn't work since the 2nd argument is a regular expression.
# The following will split based on periods.
sentences[1] "He said hi. She said bye. We went to the park."
[2] "I like ice cream! Do you? Sue likes pizza."
strsplit(sentences, "\\.")[[1]]
[1] "He said hi" " She said bye" " We went to the park"
[[2]]
[1] "I like ice cream! Do you? Sue likes pizza"
# Use a "regular expression" to instead split on any of a period,
# question mark, or exclamation point.
sentences[1] "He said hi. She said bye. We went to the park."
[2] "I like ice cream! Do you? Sue likes pizza."
strsplit(sentences, "[.?!]") # split on any one of .?![[1]]
[1] "He said hi" " She said bye" " We went to the park"
[[2]]
[1] "I like ice cream" " Do you" " Sue likes pizza"
14.14 Other functions: regmatches, regexec, regexpr, gregexpr
The following are other function in Base R that use regex. These are a little more advanced. It’s probably better to try researching these functions after first understanding the material presented in this section.
You can search online or see the R documentation for more info about these functions.
- regmatches
- regexec
- regexpr
- gregexpr
14.15 stringr functions
The stringr package includes many functions for use with character vectors. One example is str_length, which is very similar to the nchar() function in Base R.
# The str_length function is part of the stringr package.
# To use it you must install stringr (or install tidyverse, which is a
# collection of packages one of which is stringr)
str_length(c("abc", "hello", "I like ice cream!"))[1] 3 5 17
# This function is very similar to the
# nchar function that is built into base R.
nchar(c("abc", "hello", "I like ice cream!")) [1] 3 5 17
The stringr package also includes numerous functions that make use of regular expressions. The following is a table of the stringr functions and the Base R functions that can be used to accomplish similar things.
| stringr | Base R | Description |
|---|---|---|
str_detect() |
grepl() |
Returns TRUE/FALSE if pattern is found |
str_extract(), str_extract_all() |
regmatches() |
Extract matching patterns |
str_match(), str_match_all() |
regexec(), regmatches() |
Extract matched groups |
str_replace(), str_replace_all() |
sub(), gsub() |
Replace matched patterns |
str_split() |
strsplit() |
Split string on pattern |
str_subset() |
grep(value = TRUE) |
Keep strings matching pattern |
str_locate(), str_locate_all() |
regexpr(), gregexpr() |
Find positions of matches |
str_count() |
lengths(regmatches()) |
Count pattern occurrences |