11  11. stringr tutorial

This is an overview of the stringr package that is part of the “tidyverse” family of packages. The info in this section comes from this youtube playlist: https://www.youtube.com/watch?v=oIu5jK8DeX8&list=PLiC1doDIe9rDwsUhd3FtN1XGCV2ES1xZ2

See these resources for more info about the entire tidyverse family of packages.

See these links for more info about the stringr package

See these links for more info about other related tidyverse packagaes

11.1 stringr: Basic String Manipulation

chr_data <- c("Data", "Daft", "YouTube", "channel",
             "learn", "and", "have", "FUN!")
# Check the length of a string
str_length("Hi there! How are you?")
[1] 22
str_length(chr_data)
[1] 4 4 7 7 5 3 4 4
# Convert string letters to uppercase
str_to_upper(chr_data)
[1] "DATA"    "DAFT"    "YOUTUBE" "CHANNEL" "LEARN"   "AND"     "HAVE"   
[8] "FUN!"   
# Convert string letters to lowercase
str_to_lower(chr_data)
[1] "data"    "daft"    "youtube" "channel" "learn"   "and"     "have"   
[8] "fun!"   
# Convert string to title (first letter uppercase)
str_to_title(chr_data)
[1] "Data"    "Daft"    "Youtube" "Channel" "Learn"   "And"     "Have"   
[8] "Fun!"   
# Convert string to sentence (only first letter of first word uppercase)
str_to_sentence("make me into a SENTENCE!")
[1] "Make me into a sentence!"
# Trim whitespace
str_trim("  Trim Me!   ")
[1] "Trim Me!"
# Pad strings with whitespace
str_pad("Pad Me!", width = 15, side="both")
[1] "    Pad Me!    "
# Truncate strings to a given length
str_trunc("If you have a long string, you might want to truncate it!", 
          width = 50)
[1] "If you have a long string, you might want to tr..."

11.2 stringr: Split and Join Strings

# Split strings
str_split("Split Me!", pattern = " ")
[[1]]
[1] "Split" "Me!"  
food <- c(
  "apples and oranges and pears and bananas",
  "pineapples and mangos and guavas"
)

stringr::str_split(food, " and ")
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    
# Join strings (equivalent to base R paste())
str_c("Join", "Me!", sep="_")
[1] "Join_Me!"
# Join strings (equivalent to base R paste())
str_c(c("Join", "vectors"), c("Me!", "too!"), sep="_")
[1] "Join_Me!"     "vectors_too!"
# Collapse a vector of strings into a single string
str_c(c("Turn", "me", "into", "one", "string!"), collapse= " ")
[1] "Turn me into one string!"
# Convert NA values in character vector to string "NA"
str_replace_na(c("Make", NA, "strings!"))
[1] "Make"     "NA"       "strings!"

11.3 stringr: Sorting Strings

sort_data <- c("sort", "me", "please!")

# Get vector of indicies that would sort a string alphabetically
str_order(sort_data)
[1] 2 3 1
# Use discovered ordering to extract data in sorted order
sort_data[str_order(sort_data)]
[1] "me"      "please!" "sort"   
# Directly extract sorted strings
str_sort(sort_data)
[1] "me"      "please!" "sort"   
# Extract in reverse sorted order
str_sort(sort_data, decreasing = TRUE)
[1] "sort"    "please!" "me"     

11.4 stringr: String Interpolation

first <- c("Luke", "Han", "Jean-Luc")
last <- c("Skywalker", "Solo", "Picard")

# Interpolate (insert variable values) into strings with str_glue()
str_glue("My name is {first}. {first} {last}.")
My name is Luke. Luke Skywalker.
My name is Han. Han Solo.
My name is Jean-Luc. Jean-Luc Picard.
minimum_age <- 18
over_minimum <- c(5, 17, 33)

# Interpolate the result of an execution into a string
str_glue("{first} {last} is {minimum_age + over_minimum} years old.")
Luke Skywalker is 23 years old.
Han Solo is 35 years old.
Jean-Luc Picard is 51 years old.
num <- c(1:5)

# Interpolate the result of function calls
str_glue("The square root of {num} is {round(sqrt(num), 3)}.")
The square root of 1 is 1.
The square root of 2 is 1.414.
The square root of 3 is 1.732.
The square root of 4 is 2.
The square root of 5 is 2.236.
fuel_efficiency <- 30

# Interpolate strings using data from a data frame
mtcars %>% rownames_to_column("Model") %>%
         filter(mpg > fuel_efficiency) %>%
         str_glue_data("The {Model} gets {mpg} mpg.")
The Fiat 128 gets 32.4 mpg.
The Honda Civic gets 30.4 mpg.
The Toyota Corolla gets 33.9 mpg.
The Lotus Europa gets 30.4 mpg.

11.5 stringr: String Matching

head(data,8)
       author score
1  butt_ghost     3
2 buntaro_pup     1
3  iidealized     2
4   [deleted]     1
5   stathibus     6
6 soulslicer0     2
7 swiftsecond     1
                                                                                                                                                                                                                       body
1                                                                                      Hdf5. It's structured, it's easy to get data in and out, and it's fast. Plus it will scale if you ever get up there in dataset size.
2                                                                                                                                                                                                          yep, good point.
3                                                            Google must have done (and is doing) serious internal research in ranking. I've heard they're pretty good at that and they've even made some money doing it :P
4                                                                                                                                                                                                                 [deleted]
5                                                                                                   Sebastian Thrun's book, Probabilistic Robotics, goes through this in great detail. Get it, read it, make it your bible.
6 This. Such a legendary book. Kalman filters, particle filters, recursive Bayesian filters and a whole bunch of other stuff. I learnt so much. Read these 3 for starts from the book, then come back and ask the questions
7                                                                                                                                                                                                   Do you still need help?
# Detecting the presence of a pattern in strings
str_detect(data$body[1:100], pattern="deep")
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA    NA    NA    NA    NA
 [13]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [25]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [37]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [49]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [61]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [73]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [85]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [97]    NA    NA    NA    NA
# Get the indicies of matched strings
str_inds <- str_which(data$body[1:100], pattern="deep")
str_inds
integer(0)
# Extract matched strings using detected indicies
data$body[str_inds]
character(0)
# Count the number of matches
str_count(data$body[1:100], "deep")
  [1]  0  0  0  0  0  0  0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
 [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# Get the position of matches
str_locate_all(data$body[1], "deep")
[[1]]
     start end
# Get a list of the first match found in each string as a vector
str_extract(data$body[1:3], "deep|the|and")
[1] "and" NA    "and"
# Get a list of the first match found in each string as matrix
str_match(data$body[1:3], "deep|the|and")
     [,1] 
[1,] "and"
[2,] NA   
[3,] "and"
# Get a list of the all matches found in each string as list of matricies
str_match_all(data$body[1:3], "deep|the|and")
[[1]]
     [,1] 
[1,] "and"
[2,] "and"
[3,] "the"

[[2]]
     [,1]

[[3]]
     [,1] 
[1,] "and"
[2,] "the"
[3,] "and"
[4,] "the"

11.6 stringr: Subset and Replace Strings

head(data,8)
       author score
1  butt_ghost     3
2 buntaro_pup     1
3  iidealized     2
4   [deleted]     1
5   stathibus     6
6 soulslicer0     2
7 swiftsecond     1
                                                                                                                                                                                                                       body
1                                                                                      Hdf5. It's structured, it's easy to get data in and out, and it's fast. Plus it will scale if you ever get up there in dataset size.
2                                                                                                                                                                                                          yep, good point.
3                                                            Google must have done (and is doing) serious internal research in ranking. I've heard they're pretty good at that and they've even made some money doing it :P
4                                                                                                                                                                                                                 [deleted]
5                                                                                                   Sebastian Thrun's book, Probabilistic Robotics, goes through this in great detail. Get it, read it, make it your bible.
6 This. Such a legendary book. Kalman filters, particle filters, recursive Bayesian filters and a whole bunch of other stuff. I learnt so much. Read these 3 for starts from the book, then come back and ask the questions
7                                                                                                                                                                                                   Do you still need help?
# Get a string subset based on character position
str_sub(data$body[1], start=1, end=100)
[1] "Hdf5. It's structured, it's easy to get data in and out, and it's fast. Plus it will scale if you ev"
# Get a string subset based on words
word(data$body[1], start=1, end=10)
[1] "Hdf5. It's structured, it's easy to get data in and"
# Get the strings that contain a certain pattern
str_subset(data$body[1:100], pattern="deep")
character(0)
# Replace a substring with a new string by substring position
str_sub(data$body[1], start=1, end=100) <- str_to_upper(str_sub(data$body[1], 
                                                                start=1, 
                                                                end=100))
str_sub(data$body[1], start=1, end=100)
[1] "HDF5. IT'S STRUCTURED, IT'S EASY TO GET DATA IN AND OUT, AND IT'S FAST. PLUS IT WILL SCALE IF YOU EV"
# Replace first occurrence of a substring with a new string by matching
str_replace(data$body[1], pattern="deep|DEEP", replacement="multi-layer")
[1] "HDF5. IT'S STRUCTURED, IT'S EASY TO GET DATA IN AND OUT, AND IT'S FAST. PLUS IT WILL SCALE IF YOU EVer get up there in dataset size."
# Replace all occurrences of a substring with a new string by matching
str_replace_all(data$body[1], pattern="deep|DEEP", replacement="multi-layer")
[1] "HDF5. IT'S STRUCTURED, IT'S EASY TO GET DATA IN AND OUT, AND IT'S FAST. PLUS IT WILL SCALE IF YOU EVer get up there in dataset size."

11.7 stringr: Viewing Strings

# Basic printing
print(data$body[1:10])
 [1] "HDF5. IT'S STRUCTURED, IT'S EASY TO GET DATA IN AND OUT, AND IT'S FAST. PLUS IT WILL SCALE IF YOU EVer get up there in dataset size."                                                                                     
 [2] "yep, good point."                                                                                                                                                                                                         
 [3] "Google must have done (and is doing) serious internal research in ranking. I've heard they're pretty good at that and they've even made some money doing it :P"                                                           
 [4] "[deleted]"                                                                                                                                                                                                                
 [5] "Sebastian Thrun's book, Probabilistic Robotics, goes through this in great detail. Get it, read it, make it your bible."                                                                                                  
 [6] "This. Such a legendary book. Kalman filters, particle filters, recursive Bayesian filters and a whole bunch of other stuff. I learnt so much. Read these 3 for starts from the book, then come back and ask the questions"
 [7] "Do you still need help?"                                                                                                                                                                                                  
 [8] NA                                                                                                                                                                                                                         
 [9] NA                                                                                                                                                                                                                         
[10] NA                                                                                                                                                                                                                         
deep_learning_posts <- data$body[str_which(data$body, "deep learning")]

# View strings in HTML format with the first occurence of a pattern highlighted
str_view(deep_learning_posts, pattern="deep")
# View strings in HTML format with the first all occurences highlighted
str_view_all(deep_learning_posts, pattern="deep")
Warning: `str_view_all()` was deprecated in stringr 1.5.0.
ℹ Please use `str_view()` instead.
# Format strings into paragraphs of a given width with str_wrap()
wrapped <- str_wrap(data$body[str_which(data$body, "deep learning")][1], 
                    width = 50)
wrapped 
[1] NA
# Print wrapped string with output obeying newlines
wrapped %>% cat()
NA
# Display wrapped paragraph as HTML, inserting paragraph breaks
str_wrap(data$body[str_which(data$body, "deep learning")][1], width = 50) %>%
str_replace_all("\n", "<br>") %>%
str_view_all(pattern = "deep")
[1] │ NA