17  17. (a) word boundaries \b (b) PATTERN1|PATTERN2 (c) paste0 to build long patterns

17.1 WORD BOUNDARIES: \b

\b matches a “word boundary” (remember in R use \\b)

  • You can use \b before a pattern, e.g. \bSOME_PATTERN (or in R code use double backslash: \\bSOME_PATTERN) to mean that the pattern must match at the beginning of a word that is in the text - see examples below.

  • You can also use SOME_PATTERN\b to mean that the pattern must come at the end of a word - see examples below.

  • You can use \bSOME_PATTERN\b to mean that SOME_PATTERN must match an entire word (i.e. there is word break both before and after the pattern).

Note

A “word boundary” is not a particular character such as a space or comma, but rather is a position in the text.

Note

A “word” may include letters, digits or underscores (see the discussion of the \w and \W metacharacters above.)

Note

Some students get confused between \bSOME_PATTERB and ^SOME_PATTERN (caret). They differ in the following way:

  • ^SOME_PATTERN** (with a leading caret, ^) means tbat SOME_PATTERN must match at the beginning of the entire text.

  • \bSOME_PATTERN** (with a leading caret, ^) means tbat SOME_PATTERN must match at the beginning of a word within the text.

17.2 Examples using word breaks (i.e. \b)

# Some data for our examples
stuff = c("She, sells",
          "grey seashells of all shades",
          "by the seashore",
          "the man with gray hair said hello",
          "why must he smile",
          "no one want heartache but a little happens sometimes",
          "joe laughed (hehe) at his own thoughts"
          )

# Match text that contains a word that STARTS with "he"
str_view(stuff, "\\bhe")
[4] │ the man with gray hair said <he>llo
[5] │ why must <he> smile
[6] │ no one want <he>artache but a little happens sometimes
[7] │ joe laughed (<he>he) at his own thoughts
# Match text that contains a word that ENDS with "he"
str_view(stuff, "he\\b")
[1] │ S<he>, sells
[3] │ by t<he> seashore
[4] │ t<he> man with gray hair said hello
[5] │ why must <he> smile
[6] │ no one want heartac<he> but a little happens sometimes
[7] │ joe laughed (he<he>) at his own thoughts
# Match text that contains the word "he"
str_view(stuff, "\\bhe\\b")
[5] │ why must <he> smile

17.3 Practice

17.3.1 QUESTION

Write a regex that matches words such as “heartache” and “headache” that start and end with “he”. However, the word “he” should not match.

In other words, the letters, “he” must appear twice - once in the beginning, and once at the end of the word (therefore, the word “he” should NOT be considered a match).

# ANSWER
str_view(stuff, "\\bhe[A-Za-z]*he\\b")
[6] │ no one want <heartache> but a little happens sometimes
[7] │ joe laughed (<hehe>) at his own thoughts

17.3.2 QUESTION

Modify the previous answer so that the regex also matches “he”.

# ANSWER 
str_view(stuff, "\\bhe([A-Za-z]*he)*\\b")
[5] │ why must <he> smile
[6] │ no one want <heartache> but a little happens sometimes
[7] │ joe laughed (<hehe>) at his own thoughts

17.4 The | (i.e. or) symbol

17.4.1 QUESTION

Write a regex to match words that have “he” at the beginning. Do not match words that only have “he” in the middle of a word.

# ANSWER

# Hint - use the | symbol to allow for multiple patterns
str_view(stuff, "\\bhe|he\\b")
[1] │ S<he>, sells
[3] │ by t<he> seashore
[4] │ t<he> man with gray hair said <he>llo
[5] │ why must <he> smile
[6] │ no one want <he>artac<he> but a little happens sometimes
[7] │ joe laughed (<he><he>) at his own thoughts

17.4.2 Order of operations in R with |

The answer to the previous question also works with (parentheses) around each regex option. However, as shown above the (parentheses) are not necessary in this case.

str_view(stuff, "(\\bhe)|(he\\b)")
[1] │ S<he>, sells
[3] │ by t<he> seashore
[4] │ t<he> man with gray hair said <he>llo
[5] │ why must <he> smile
[6] │ no one want <he>artac<he> but a little happens sometimes
[7] │ joe laughed (<he><he>) at his own thoughts

Parentheses become necessary when you want to use | in the middle of a larger pattern. For example:

18 QUESTION

Write a regex that matches the words “grey” or “gray” with either spelling. Keep the regex as short as possible.

# ANSWER

sentences = c("blue skies are ahead",
              "it's going to be a grey day",
              "oragne you glad i didn't say orange",
              "the man with gray hair said hello")


# Hint - use the | symbol to allow for multiple patterns
str_view(sentences, "gr(e|a)y")
[2] │ it's going to be a <grey> day
[4] │ the man with <gray> hair said hello
# Of course this works also but the answer above is shorter
str_view(sentences, "grey|gray")
[2] │ it's going to be a <grey> day
[4] │ the man with <gray> hair said hello

18.1 More \b examples

# fruits that have a word that starts with H or h
grep ("\\b[Hh]", fruit, value=TRUE)  
[1] "Beurre Hardy pear" "honeydew"         
# fruits that end with a vowel
grep ("[aeiouAEIOU]\\b", fruit, value=TRUE)  
[1] "apple"             "N. American apple" "Beurre Hardy pear"
[4] "banana"           

18.2 “pattern1|pattern2” matches pattern1 OR pattern2

# "pattern1|pattern2"  matches pattern1 OR pattern2 ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Combining patterns
#
#    PATTERN1|PATTERN2  matches if either PATTERN1 or PATTERN2 is found
#
#    (PATTERN)          you may surround patterns with (parentheses) if necessary
#
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

grep("black|blue|green", fruit, value=TRUE) # contains black,blue or green
[1] "black cherry" "blueberry"   
grep("^(1|One)", addresses, value=TRUE, ignore.case=TRUE) # 1 or One at beginning 
[1] "12345 Sesame Street"     "One Micro$oft Way"      
[3] "One Main Street Apt 12b"
grep("(^1|^One)", addresses, value=TRUE) # same thing 
[1] "12345 Sesame Street"     "One Micro$oft Way"      
[3] "One Main Street Apt 12b"
grep("^1|^One", addresses, value=TRUE) # same thing 
[1] "12345 Sesame Street"     "One Micro$oft Way"      
[3] "One Main Street Apt 12b"
grep("[0-9]", addresses, value=TRUE)
 [1] "12345 Sesame Street"              "3 Olive St."                     
 [3] "Two 1st Ave."                     "5678 Park Place"                 
 [5] "Forty Five 2nd Street"            "Ninety Nine Cone St. apartment 7"
 [7] "9 Main St. apt. 623"              "4\\2 Rechov Yafo"                
 [9] "One Main Street Apt 12b"          "Two Main Street Apt 123c"        
[11] "Three Main Street Apt 12343"     
grep("0|1|2|3|4|5|6|7|8|9", addresses, value=TRUE)  # Same as [0-9]
 [1] "12345 Sesame Street"              "3 Olive St."                     
 [3] "Two 1st Ave."                     "5678 Park Place"                 
 [5] "Forty Five 2nd Street"            "Ninety Nine Cone St. apartment 7"
 [7] "9 Main St. apt. 623"              "4\\2 Rechov Yafo"                
 [9] "One Main Street Apt 12b"          "Two Main Street Apt 123c"        
[11] "Three Main Street Apt 12343"     

18.3 breaking up long patterns with paste0

# breaking up long patterns with paste0 ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Do NOT include extra whitespace in patterns!!!
# 
# For long patterns you can use paste0 to break up the pattern
# so it is more readable in the code.
#
# See examples below.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

18.3.1 example

# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Show those addresses that contain one of the numbers 1-9 spelled out in words,
# e.g. "one", "two", etc
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ANSWER - note that writing the pattern using paste0 with collapse = ""
# makes the pattern easy to understand. You can also comment on 
# individual portions of the pattern. 

pattern = paste0("\\bone\\b|",     # match the word "one"
                 "\\btwo\\b|",     # match the word "two"
                 "\\bthree\\b|",   # etc.
                 "\\bfour\\b|",
                 "\\bfive\\b|",
                 "\\bsix\\b|",
                 "\\bseven\\b|",
                 "\\beight\\b|",
                 "\\bnine\\b")

grep(pattern, addresses, value=TRUE, ignore.case = TRUE)
[1] "One Micro$oft Way"                "Two 1st Ave."                    
[3] "Forty Five 2nd Street"            "Ninety Nine Cone St. apartment 7"
[5] "Five Google Drive"                "One Main Street Apt 12b"         
[7] "Two Main Street Apt 123c"         "Three Main Street Apt 12343"     
# Note that the following also works but is
#   - MUCH harder to read 
#   - MUCH harder to check for errors and
#   - cannot be commented on for different parts of the pattern

grep(
  "\\bone\\b|\\btwo\\b|\\bthree\\b|\\bfour\\b|\\bfive\\b|\\bsix\\b|\\bseven\\b|\\beight\\b|\\bnine\\b",
  addresses, value=TRUE, ignore.case = TRUE)
[1] "One Micro$oft Way"                "Two 1st Ave."                    
[3] "Forty Five 2nd Street"            "Ninety Nine Cone St. apartment 7"
[5] "Five Google Drive"                "One Main Street Apt 12b"         
[7] "Two Main Street Apt 123c"         "Three Main Street Apt 12343"     

18.3.2 example

# A more complex example ####

# QUESTION 
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Display those addresses that contain a number that is exactly one
# digit long. For example: 
#
#   addresses =
#    c("3 Olive St.",              # should match (because of 3)
#      "Forty Five 2nd Street",    # should match (because of 2nd)
#      "Ninety Nine Cone St. apartment 7",
#                                  # should match (because of 7)
#      "7",                        # should match
#
#      "12345 Sesame Street",      # should NOT match (12345 is five digits)
#      "One main Street Apt 12b",  # should NOT match (12 is two digits)
#      "Two Main St. Apt 99",      # should NOT match (99 is two digits) 
#      "45")                       # should NOT match
#
#   > YOUR COMMAND GOES HERE
#   [1] "3 Olive St."
#   [2] "Forty Five 2nd Street"
#   [3] "Ninety Nine Cone St. apartment 7"
#   [4] "7"
#
# NOTE: the pattern "[0-9]" will NOT work as it will match every one of 
# values above
#
# NOTE: the pattern "\\b[0-9]\\b" is a good try but will not match
# "Forty Five 2nd Street" as the 2 in "2nd" is NOT followed by a word boundary.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ANSWER

# Note - there are 4 possible ways for a single digit appear in the text:
#
# (a) The digit can appear at the very beginning of the text and be followed by
#     a non-digit, e.g. "3 Olive street". 
#
#     The pattern: "^[0-9][^0-9]"     
#     matches "3 Olive street"
#     but doesn't match "Forty Five 2nd Street" (since the 2 is not at the
#                                                beginning of the text).
#
# Similarly, each of the following patterns will match a single digit for
# some texts but not for others. 
#
# (b) [^0-9][0-9][^0-9] : NONdigit digit NONdigit anywhere in the text
# (c) [^0-9][0-9]$      : last two characters are a NONdigit followed by a single digit
# (d) ^[0-9]$           : whole thing is JUST one digit
#
# For actual addresses you probably don't have to worry about the last
# case, but for other types of data you might.
#
# You can write a pattern that deals with all of these cases by
# separating the different "sub-patterns" from each other with "|" symbols.
# For example, the following answers the question, but the pattern is VERY
# hard to read. (see below for a better way to write this code.)

grep("^[0-9][^0-9]|[^0-9][0-9][^0-9]|[^0-9][0-9]$|^[0-9]$", addresses, value=TRUE)
[1] "3 Olive St."                      "Two 1st Ave."                    
[3] "Forty Five 2nd Street"            "Ninety Nine Cone St. apartment 7"
[5] "9 Main St. apt. 623"              "4\\2 Rechov Yafo"                
# we can use paste0 to make this easier to read

pattern <- 
  paste0 ( "^[0-9][^0-9]" ,     # starts with digit followed by a NONdigit
         "|[^0-9][0-9][^0-9]",  # NONdigit digit NONdigit anywhere in the text 
         "|[^0-9][0-9]$",       # ends with a NONdigit followed by a single digit
         "|^[0-9]$")            # whole thing is JUST one digit 

grep(pattern, addresses, value=TRUE)
[1] "3 Olive St."                      "Two 1st Ave."                    
[3] "Forty Five 2nd Street"            "Ninety Nine Cone St. apartment 7"
[5] "9 Main St. apt. 623"              "4\\2 Rechov Yafo"                

18.3.3 example

# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Write a command that displays all addresses that contain the
# number "one" or 1.
#
# Notice that the following will NOT work. This gets "Cone" and 12345 too:
#
#   grep("one|1", addresses, value=TRUE, ignore.case=TRUE) # NO - matches Cone and 12345
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ANSWER

# The following is the way to do it. Regular expressions require 
# a very thoughtful approach!
#
# The word "one" or the number "1" (not including the number 123)
pattern = paste0("^one[^a-z]|",       # one at beginning
                 "[^a-z]one[^a-z]|",  # one in middle
                 "[^a-z]one$|",       # one at end
                 "^one$|",            # ONLY the word "one"
                 "^1[^0-9]|",         # 1 at beginning
                 "[^0-9]1[^0-9]|",    # 1 in middle
                 "[^0-9]1$|",         # 1 at end
                 "^1$")               # ONLY the number 1


pattern
[1] "^one[^a-z]|[^a-z]one[^a-z]|[^a-z]one$|^one$|^1[^0-9]|[^0-9]1[^0-9]|[^0-9]1$|^1$"
grep(pattern, addresses, value=TRUE, ignore.case=TRUE)
[1] "One Micro$oft Way"       "Two 1st Ave."           
[3] "One Main Street Apt 12b"
# Same thing but MUCH harder to read!!!
# You should break up long patterns with paste0 and comment them as shown above.

grep(
  "^one[^a-z]|[^a-z]one[^a-z]|[^a-z]one$|^one$|^1[^0-9]|[^0-9]1[^0-9]|[^0-9]1$|^1$",
  addresses, value=TRUE, ignore.case=TRUE)
[1] "One Micro$oft Way"       "Two 1st Ave."           
[3] "One Main Street Apt 12b"

18.3.4 example

# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# df is a dataframe. Write a command that shows all rows from df
# for which the 2nd character in the first column is "x". 
# 
# Hints: 
#   a. Access a dataframe as you normally would but use grep or lgrep to
#      return either the row numbers or TRUE/FALSE
#      values that identify the rows to be displayed.
# 
#   b. Remember that you are NOT told what the column names are. Therefore you
#      must use a number to stipulate the first column and NOT a column name.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Use this example data
df = data.frame( partNum = c("ax4321", "az12", "bx1234", "bw987"),
                 partName = c("widget","thingie","gadget","gizmo"),
                 price =    c(0.50, 0.60, 1.70, 0.80),
                 stringsAsFactors = FALSE)
df
  partNum partName price
1  ax4321   widget   0.5
2    az12  thingie   0.6
3  bx1234   gadget   1.7
4   bw987    gizmo   0.8
# Show the rows that contain "x" as the 2nd character in the partNum


# One answer - using grep
df[ grep ( "^.x", df$partNum , ignore.case=TRUE ) ,   ]
  partNum partName price
1  ax4321   widget   0.5
3  bx1234   gadget   1.7
# Another answer - using grepl
df[ grepl ( "^.x", df$partNum , ignore.case=TRUE ) ,   ]
  partNum partName price
1  ax4321   widget   0.5
3  bx1234   gadget   1.7

18.3.5 example

# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Using the same data as above, only show those rows that contain an "x"
# in the 2nd character of the partNum whose price is also less than 1.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
df
  partNum partName price
1  ax4321   widget   0.5
2    az12  thingie   0.6
3  bx1234   gadget   1.7
4   bw987    gizmo   0.8
df[ grepl ( "^.x", df$partNum , ignore.case=TRUE ) & df$price < 1 ,   ]
  partNum partName price
1  ax4321   widget   0.5