2424. Practice - Searching and Cleaning Data (with regex in R)
24.1 Introduction - about the data
In this assignment, you’ll practice using regular expressions to identify and clean messy data. The dataset below contains employee expense records with various inconsistencies in formatting. Some of the inconsistencies are described below the data.
24.1.1 Employee expenses data
There are a total of 49 rows. The following are just the first 10 rows.
employee date amount category comments
1 Sue Smith 1/19/2023 59.99 food lunch with client
2 schwartz, joe 01/19/2023 $27.00 office supplies paper for printer
3 mike david harris 2023-01-19 25 Office Stuff <NA>
4 Dr. A. Davis 19/1/2023 five hundred FOOD NA
5 Dr Jones 1/19/23 1,234.56 office suppl. chairs
6 S. Jones Jr 19/01/23 1000 Office supplies desk
7 Conway, Ella Sr. Jan 19, 2023 $35.23 LUNCH ---
8 Brown, Tom 2/15/2023 42.50 food dinner meeting
9 Williams, Kate 02/15/2023 $89.99 office supplies toner cartridge
10 robert james lee 2023/02/15 75 Office Stuff
Click on the link below to see the full dataset.
click here to see the full data set
Copy the following code into R and run it to recreate the full dataset.
some names are in “first last” (e.g. “Sue Smith”) format and some names are in “last, first” format (e.g. “Brown, Tom”)
some names are capitalized, some aren’t
some names have middle include middle names, some don’t
some names include titles, some don’t
some names include a title, e.g. Dr. , while some don’t
some of the titles contain a period after the title, some don’t
some names include a suffix (e.g. Jr), some don’t
some names are missing first names
some names use periods (e.g. J. Thompson), some don’t
possibly other inconsistencies …
Inconsistencies with date column
some dates are in mm/dd/yyyy format (USA style)
some dates are in dd/mm/yyyy format (European style)
some dates are in yyyy-mm-dd format (standard for many computer systems)
some dates use 4 digit years, some use 2 digit years
some dates have a leading zero for single digit days or months
possibly other inconsistencies …
Inconsistencies in the amount column
some values have $ signs, some don’t,
some have commas (e.g. 1,234.56) some don’t
some are missing the cents (e.g. “75”)
some are written in words (e.g. “five hundred”)
possibly other inconsistencies …
Inconsistencies in the category column
the same categories are written differently (e.g. “food” vs “FOOD”, “office supplies” vs “Office suppl.” vs “OFFICE STUFF”)
possibly other inconsistencies …
Inconsistencies in the comments column
blanks are expressed differently: NA vs “NA” vs “—” vs ““.
Note that in this column, other differences are more forgivable since comments are by their very nature written by people differently and are not expected to be in a standard form. However, there should be a standard way to express that there are no comments for a given row.
24.2 Questions - PART I (writing code)
24.2.1 Instructions
Your task is to write R code using regular expressions to identify different patterns in the data.
Each question starts with a fresh copy of the data
For each question, assume that you start with a fresh copy of the data. In other words, if question #5 asks you to change the data, assume that question #6 starts again with the original data shown above.
Focus on regex code, but use other R code if necessary
You may use any of the regex functions we covered in class:
grep(): Find patterns and return matching indices or values
grepl(): Find patterns and return TRUE/FALSE for each element
sub(): Replace the first occurrence of a pattern
gsub(): Replace all occurrences of a pattern
strsplit(): Split strings based on a pattern
In addition you can use other R code to help.
Advice on approaching the questions
For each question:
First try to solve it yourself
If you get stuck, try to break down the pattern you’re looking for into smaller pieces
Test your solution with the sample data to make sure it catches all cases
The answer must work even if the data changes.
You solution MUST work, even if the actual data changes. For example
QUESTION: Display all entries in the comments column that contain the word “for”.
CORRECT ANSWER: The following answer uses the \b metacharacter to indicate the start and the end of the word.
grep("\\bfor\\b", expenses$comments, value=TRUE)
[1] "paper for printer"
WRONG ANSWER: The following is obviously a wrong answer, even though it produces the same results. The answer refers to a specific row. The output would be different if the order of the rows changed. The answer must assume no knowledge of the actual data in the dataframe.
expenses[2,]
employee date amount category comments
2 schwartz, joe 01/19/2023 $27.00 office supplies paper for printer
ANOTHER WRONG ANSWER: The following is not totally correct either even though the output looks right. Can you see why?
grep("for", expenses$comments, value=TRUE)
[1] "paper for printer"
# THE ANSWER IS WRONG BECAUSE ...## Suppose someone were to add a new row that contained the comment # "printer forms", the code## grep("for", expenses$comments, value=TRUE) ## simply searches for the # letters "for" anywhere in the comment and would incorrectly match# the "for" in the word "forms". The question specifically said that # the comment must contain the word for. Even though none of the words# "form" or "forms" or "format" or "before" or "information", etc.# appear in the data, the question asked for the word "for" and you # answer should only pick up the word "for".## Let's try itnewExpense =data.frame(employee ="Joe Smith",date ="1/1/2026",amount ="$100.00",category ="office supplies",comments ="printer forms")expenses =rbind(expenses, newExpense)grep("for", expenses$comments, value=TRUE)
[1] "paper for printer" "printer forms"
THERE COULD BE MORE THAN ONE CORRECT ANSWER: Suppse you forgot about the \b metacharacter (which indicates a word break), you could have answered the question using the following code:
pattern =paste0("^for$", # "for" is the only text in the comment"|^for[^A-Za-z]", # or "for" is at the beginning and followed by a non-letter "|[^A-Za-z]for$", # or "for" is at the end and preceded by a non-letter"|[^A-Za-z]for[^A-Za-z]"# or "for" is in the middle and preceded and followed by non-letters )grep(pattern, expenses$comments, value=TRUE)
[1] "paper for printer"
24.2.2 Warm up Questions
Question 1a
Write R code that outputs all entries in the employee column where the very last character in the entry is a lowercase “i” (Do not show the entry if the ending is i followed by a space or a period, etc)
Given the data above, your output should show the following:
[1] "Chen, Wei"
Question 1b
Modify your answer to the previous part so that the code outputs all entries where the last LETTER is lowercase “i” (even if it might be followed by digits or spaces or periods, etc)
Given the data above, your output should show the following:
[1] "Chen, Wei" "Mike Turchini "
Click here for answer
Question 2a
Show all names that have whitespace (spaces, tabs, newlines) at the beginning or end of the entry.
Given the data above, your output should show the following:
:::
[1] " schwartz, joe" " Williams, Kate " " Williams, Kate " "Mike Turchini "
Question 2b
Use regular expressions to remove all whitespace that appears before or after an entry in the employee column. You can do this in more than one command if you like.
Prove that your code worked by comparing the names that you found in previous question to the new versions from this question.
Click here for answer
############################ Some preparation############################ Get the rows that contain leading or trailing whitespacerowNumbers =grep("^\\s+|\\s+$", expenses$employee)rowNumbers
[1] 2 9 34 49
# Take a copy of the original data in order to modify it without affecting the original datanewExpenses = expenses############################# Modify the new dataframe############################# Remove any spaces BEFORE the namesnewExpenses$employee =sub("^\\s+", "", expenses$employee)# Remove any spaces AFTER the namesnewExpenses$employee =sub("\\s+$", "", expenses$employee)################################################################################ Compare the rows that were changed in the old dataframe and the new dataframe###############################################################################data.frame ( oldEntries =paste0('"' , expenses$employee[rowNumbers] , '"'),fixedEntries =paste0('"' , newExpenses$employee[rowNumbers] , '"') )
Write R code using regex to identify all names that are in “last, first” format (e.g., “Schwartz, Joe”, including those with suffixes like “Sr.” or “Jr.”). Make sure that you allow for any number of spaces between the comma and the first name (including if there aren’t any spaces).
Given the data above, your output should show the following:
Conway, Ella Sr.
Brown, Tom
Martinez, Ana Sr.
Johnson,Mary
Cooper, Sam
Rodriguez, Eva Sr.
Chen, Wei
Garcia, Maria Jr.
Brown, Tom
Martinez, Ana Sr.
Johnson,Mary
Cooper, Sam
Rodriguez, Eva Sr.
Conway, Ella Sr.
Brown, Tom
Martinez, Ana Sr.
Johnson,Mary
Cooper, Sam
Rodriguez, Eva Sr.
Chen, Wei
Garcia, Maria Jr.
Brown, Tom
Martinez, Ana Sr.
Johnson,Mary
Cooper, Sam
Rodriguez, Eva Sr.
Question 1b
Modify your answer to show only names that are strictly in
last, first form (without a suffix)
Given the data above, your output should show the following:
Brown, Tom
Johnson,Mary
Cooper, Sam
Chen, Wei
Brown, Tom
Johnson,Mary
Cooper, Sam
Brown, Tom
Johnson,Mary
Cooper, Sam
Chen, Wei
Brown, Tom
Johnson,Mary
Cooper, Sam
Question 2
Write R code using regex to display only names that begin with “Dr” (with or without a period). Make sure that you don’t display names (e.g. “Drake Wilson”) that begin with Dr but who aren’t doctors.
Given the data above, your output should show the following:
Dr. A. Davis
Dr Jones
Dr. B. Wilson
Dr Smith
Dr. C. Taylor
Dr Chen
Dr. Evans
Dr. B. Wilson
Dr Smith
Dr. C. Taylor
Dr Chen
Dr. A. Davis
Dr Jones
Dr. B. Wilson
Dr Smith
Dr. C. Taylor
Dr Chen
Dr. Evans
Dr. B. Wilson
Dr Smith
Dr. C. Taylor
Dr Chen
Question 3
Write R code using regex to identify all names that are completely lowercase (no capitalization).
Given the data above, your output should show the following:
schwartz, joe
mike david harris
robert james lee
mike green jr.
peter michael wu
jennifer ann martinez
robert james lee
mike green jr.
peter michael wu
schwartz, joe
mike david harris
robert james lee
mike green jr.
peter michael wu
jennifer ann martinez
robert james lee
mike green jr.
peter michael wu
Question 4
Write R code using regex to identify all names that include a middle name or middle initial.
Given the data above, your output should show the following:
mike david harris
Dr. A. Davis
S. Jones Jr
Conway, Ella Sr.
robert james lee
mike green jr.
Dr. B. Wilson
R. Brown Jr
Martinez, Ana Sr.
peter michael wu
Dr. C. Taylor
T. Wilson Jr
Rodriguez, Eva Sr.
jennifer ann martinez
Ms. B. Kim
R. Patel Ph.D
Garcia, Maria Jr.
robert james lee
mike green jr.
Dr. B. Wilson
R. Brown Jr
Martinez, Ana Sr.
peter michael wu
Dr. C. Taylor
T. Wilson Jr
Rodriguez, Eva Sr.
mike david harris
Dr. A. Davis
S. Jones Jr
Conway, Ella Sr.
robert james lee
mike green jr.
Dr. B. Wilson
R. Brown Jr
Martinez, Ana Sr.
peter michael wu
Dr. C. Taylor
T. Wilson Jr
Rodriguez, Eva Sr.
jennifer ann martinez
Ms. B. Kim
R. Patel Ph.D
Garcia, Maria Jr.
robert james lee
mike green jr.
Dr. B. Wilson
R. Brown Jr
Martinez, Ana Sr.
peter michael wu
Dr. C. Taylor
T. Wilson Jr
Rodriguez, Eva Sr.
Question 5
Write R code using regex to identify all names that have a suffix (Sr. or Jr., with or without the period).
Given the data above, your output should show the following:
S. Jones Jr
Conway, Ella Sr.
R. Brown Jr
Martinez, Ana Sr.
T. Wilson Jr
Rodriguez, Eva Sr.
Garcia, Maria Jr.
R. Brown Jr
Martinez, Ana Sr.
T. Wilson Jr
Rodriguez, Eva Sr.
S. Jones Jr
Conway, Ella Sr.
R. Brown Jr
Martinez, Ana Sr.
T. Wilson Jr
Rodriguez, Eva Sr.
Garcia, Maria Jr.
R. Brown Jr
Martinez, Ana Sr.
T. Wilson Jr
Rodriguez, Eva Sr.
24.2.4 Date Format Questions
Question 6
Write R code using regex to identify dates in the MM/DD/YYYY format (with or without leading zeros), ensuring valid months (1-12) and days (1-31).
Given the data above, your output should show the following:
Write R code using regex to identify comments that are either NA, “NA”, or consist entirely of dashes (e.g “—” and “—–”) or are the empty string, i.e. ““.
Given the data above, your output should show the following:
lunch with client
dinner meeting
team lunch
team dinner
client lunch
team lunch
staff lunch
team outing
24.2.8 Challenge Questions
Question 18
Write R code using regex to extract just the last name from any name format in the employee column. Your solution should work for all name formats in the dataset.
Given the data above, your output should show the following:
Smith
schwartz
harris
Davis
Jones
Jones
Conway
Brown
Williams
lee
green
Wilson
Smith
Brown
Martinez
Johnson
Cooper
wu
Taylor
Wilson
Chen
Wilson
Rodriguez
Williams
Chen
martinez
Kim
Thompson
Patel
Garcia
Evans
Thompson
Brown
Williams
lee
green
Wilson
Smith
Brown
Martinez
Johnson
Cooper
wu
Taylor
Wilson
Chen
Wilson
Rodriguez
Turchini
Click here for answer
name =trimws(expenses$employee)# Strip trailing suffixes (Jr, Sr, Ph.D, III, II, IV) so "last word" is the surname (case-insensitive)suffix_pat ="\\s*(Jr\\.?|Sr\\.?|Ph\\.?D\\.?|III|IV|II)\\s*$"name_clean =trimws(sub(suffix_pat, "", name, ignore.case =TRUE))last_name =ifelse(grepl("^[^,]+,", name),trimws(sub("^([^,]+),.*", "\\1", name)),sub(".*\\s+([^ ]+)$", "\\1", name_clean))# Remove any suffix from last_name (e.g. "Smith Jr" in comma format)answer =trimws(sub(suffix_pat, "", last_name, ignore.case =TRUE))
cat(answer, sep="\n")
Smith
schwartz
harris
Davis
Jones
Jones
Conway
Brown
Williams
lee
green
Wilson
Smith
Brown
Martinez
Johnson
Cooper
wu
Taylor
Wilson
Chen
Wilson
Rodriguez
Williams
Chen
martinez
Kim
Thompson
Patel
Garcia
Evans
Thompson
Brown
Williams
lee
green
Wilson
Smith
Brown
Martinez
Johnson
Cooper
wu
Taylor
Wilson
Chen
Wilson
Rodriguez
Turchini
Question 19
Write R code using regex to convert all dates to YYYY-MM-DD format, regardless of their original format. Your solution should handle all date formats in the dataset.
Given the data above, your output should show the following:
2023-01-19
2023-01-19
NA
NA
23-01-19
NA
NA
2023-02-15
2023-02-15
NA
NA
NA
23-02-15
NA
NA
2023-03-22
2023-03-22
NA
2023-02-03
24-03-14
23-03-02
NA
NA
2023-02-20
NA
NA
NA
23-02-21
NA
NA
2023-03-01
2023-03-05
2023-03-10
2023-03-01
NA
NA
NA
23-03-01
NA
NA
2023-03-25
2023-03-25
NA
2023-03-10
24-04-01
23-03-10
NA
NA
NA
2023-01-19
2023-01-19
NA
NA
23-01-19
NA
NA
2023-02-15
2023-02-15
NA
NA
NA
23-02-15
NA
NA
2023-03-22
2023-03-22
NA
2023-02-03
24-03-14
23-03-02
NA
NA
2023-02-20
NA
NA
NA
23-02-21
NA
NA
2023-03-01
2023-03-05
2023-03-10
2023-03-01
NA
NA
NA
23-03-01
NA
NA
2023-03-25
2023-03-25
NA
2023-03-10
24-04-01
23-03-10
NA
NA
NA
Question 20
Write R code using regex to standardize all amounts to a decimal number format (e.g., “1234.56”), removing dollar signs, commas, and converting written amounts to numbers.
Given the data above, your output should show the following:
Write a function named cleanup that takes the dataframe as an argument and returns a cleaned dataframe. Each column should have a consistent format for all columns. Do not lose any data in the process.
Split the employee column into the following fields (and remove the original single employee column):
employeeFullName — a paste of all the name components below (title, first, middle, last, suffix)
employeeTitle — e.g. Dr, Ms (no period)
employeeFirst — first name or initial
employeeMiddle — middle name or initial (NA if none)
employeeLast — last name or initial
employeeSuffix — e.g. Sr, Jr (no period; NA if none)
Leave out periods from all name parts (e.g. J. becomes J, Dr. becomes Dr).
All other columns (date, amount, category, comments) should be cleaned to a consistent structure as before.
Given the data above, your output should show the following:
cleanup(expenses)
employeeFullName employeeTitle employeeFirst employeeMiddle employeeLast employeeSuffix date amount category comments
1 Sue Smith <NA> Sue <NA> Smith <NA> 2023-01-19 59.99 food lunch with client
2 Joe Schwartz <NA> Joe <NA> Schwartz <NA> 2023-01-19 27.00 office supplies paper for printer
3 Mike David Harris <NA> Mike David Harris <NA> 2023-01-19 25.00 office supplies <NA>
4 Dr A Davis Dr A <NA> Davis <NA> 2023-01-19 500.00 food <NA>
5 Dr Jones Dr <NA> <NA> Jones <NA> 2023-01-19 1234.56 office supplies chairs
6 S Jones Jr <NA> S <NA> Jones Jr 2023-01-19 1000.00 office supplies desk
7 Ella Conway Sr <NA> Ella <NA> Conway Sr 2023-01-19 35.23 food <NA>
8 Tom Brown <NA> Tom <NA> Brown <NA> 2023-02-15 42.50 food dinner meeting
9 Kate Williams <NA> Kate <NA> Williams <NA> 2023-02-15 89.99 office supplies toner cartridge
10 Robert James Lee <NA> Robert James Lee <NA> 2023-02-15 75.00 office supplies <NA>
11 Mike Green Jr <NA> Mike <NA> Green Jr 2023-02-15 500.00 office supplies <NA>
12 Dr B Wilson Dr B <NA> Wilson <NA> 2023-02-15 200.00 food <NA>
13 Dr Smith Dr <NA> <NA> Smith <NA> 2023-02-15 2345.67 office supplies monitor
14 R Brown Jr <NA> R <NA> Brown Jr 2023-03-02 750.00 office supplies filing cabinet
15 Ana Martinez Sr <NA> Ana <NA> Martinez Sr 2023-02-15 48.75 food <NA>
16 Mary Johnson <NA> Mary <NA> Johnson <NA> 2023-03-22 63.45 food team lunch
17 Sam Cooper <NA> Sam <NA> Cooper <NA> 2023-03-22 156.00 office supplies office decoration
18 Peter Michael Wu <NA> Peter Michael Wu <NA> 2023-03-22 50.00 office supplies <NA>
19 Dr C Taylor Dr C <NA> Taylor <NA> 2023-02-03 1000.00 food <NA>
20 Drake Wilson <NA> Drake <NA> Wilson <NA> 2024-03-14 500.00 office supplies printer
21 Dr Chen Dr <NA> <NA> Chen <NA> 2023-03-02 3456.78 office supplies laptop
22 T Wilson Jr <NA> T <NA> Wilson Jr <NA> 500.00 office supplies printer
23 Eva Rodriguez Sr <NA> Eva <NA> Rodriguez Sr 2023-03-22 92.45 food <NA>
24 Pat Williams <NA> Pat <NA> Williams <NA> 2023-02-20 150.00 travel train ticket
25 Wei Chen <NA> Wei <NA> Chen <NA> 2023-02-20 89.50 travel hotel
26 Jennifer Ann Martinez <NA> Jennifer Ann Martinez <NA> 2023-02-21 42.00 food <NA>
27 Ms B Kim Ms B <NA> Kim <NA> 2023-02-21 12.50 office supplies stapler
28 J Thompson <NA> J <NA> Thompson <NA> 2023-02-21 2500.00 travel flight
29 R Patel <NA> R <NA> Patel <NA> 2023-02-21 75.00 office supplies notebooks
30 Maria Garcia Jr <NA> Maria <NA> Garcia Jr 2023-02-22 120.00 food team dinner
31 Dr Evans Dr <NA> <NA> Evans <NA> 2023-03-01 95.99 office supplies supplies
32 Thompson III <NA> <NA> <NA> Thompson III 2023-03-05 200.00 travel conference
33 Tom Brown <NA> Tom <NA> Brown <NA> 2023-03-10 65.00 food client lunch
34 Kate Williams <NA> Kate <NA> Williams <NA> 2023-03-01 45.00 office supplies paper restock
35 Robert James Lee <NA> Robert James Lee <NA> 2023-03-01 25.00 office supplies mouse pad
36 Mike Green Jr <NA> Mike <NA> Green Jr 2023-03-05 150.00 office supplies desk lamp
37 Dr B Wilson Dr B <NA> Wilson <NA> 2023-02-20 85.00 food team lunch
38 Dr Smith Dr <NA> <NA> Smith <NA> 2023-03-01 800.00 office supplies tablet
39 R Brown Jr <NA> R <NA> Brown Jr 2023-03-10 300.00 office supplies office chair
40 Ana Martinez Sr <NA> Ana <NA> Martinez Sr 2023-02-20 30.00 food coffee meeting
41 Mary Johnson <NA> Mary <NA> Johnson <NA> 2023-03-25 22.00 food snacks
42 Sam Cooper <NA> Sam <NA> Cooper <NA> 2023-03-25 75.00 office supplies whiteboard markers
43 Peter Michael Wu <NA> Peter Michael Wu <NA> 2023-03-25 35.00 office supplies mouse
44 Dr C Taylor Dr C <NA> Taylor <NA> 2023-03-10 120.00 food staff lunch
45 Drake Wilson <NA> Drake <NA> Wilson <NA> 2024-04-01 250.00 office supplies scanner
46 Dr Chen Dr <NA> <NA> Chen <NA> 2023-03-10 900.00 office supplies monitor
47 T Wilson Jr <NA> T <NA> Wilson Jr <NA> 200.00 office supplies router
48 Eva Rodriguez Sr <NA> Eva <NA> Rodriguez Sr 2023-04-01 55.00 food team outing
49 Mike Turchini <NA> Mike <NA> Turchini <NA> 2024-01-31 100.00 food <NA>
Click here for one possible solution
# =============================================================================# STRING HELPERS# =============================================================================# strip_periods(text)# Arguments: text = character vector (e.g. "Dr.", "J. Smith")# Returns: character vector of same length with periods removed and# leading/trailing spaces trimmed.# Example: strip_periods(" J. ") -> "J"strip_periods <-function(text) {gsub("\\.", "", trimws(text))}# to_sentence_case(text)# Arguments: text = single string (e.g. "mike david", "O'BRIEN")# Returns: string with first letter of each word uppercase, rest lowercase (Sentence case).# Returns NA unchanged if text is NA; empty string returned as-is.# Example: to_sentence_case("mike david harris") -> "Mike David Harris"to_sentence_case <-function(text) {if (is.na(text) ||nchar(trimws(text)) ==0) return(text) words <-strsplit(trimws(text), "\\s+")[[1]] capped <-paste0(toupper(substring(words, 1, 1)), tolower(substring(words, 2)))paste(capped, collapse =" ")}# =============================================================================# NAME PARSING — split into small functions by format# =============================================================================# has_suffix(namePart)# Arguments: namePart = single string, no leading/trailing title (e.g. "Jones Jr")# Returns: TRUE if name ends with a suffix like Jr, Sr, III, II, IV.# Example: has_suffix("Jones Jr") -> TRUE# Suffix pattern: one of Jr/Sr/III/II/IV at end. Require at least one space before suffix# so that "Thompson III" correctly captures "III" (R's regex can otherwise match "II" inside "III").suffix_pattern <-"(Jr\\.?|Sr\\.?|III|II|IV)"suffix_at_end_regex <-paste0("\\s+", suffix_pattern, "\\s*$")has_suffix <-function(namePart) {grepl(suffix_at_end_regex, namePart, ignore.case =TRUE)}# extract_suffix(namePart)# Arguments: namePart = string that may end with a suffix# Returns: list of two: suffix (string or NA), and nameWithSuffixRemoved (string).# Example: extract_suffix("Conway, Ella Sr.") -> list(suffix = "Sr", nameWithSuffixRemoved = "Conway, Ella")# Uses two groups (name)(suffix) so the suffix is captured correctly; space before suffix required.extract_suffix <-function(namePart) {if (!has_suffix(namePart)) {return(list(suffix =NA, nameWithSuffixRemoved =trimws(namePart))) }# Two groups: everything before the suffix, and the suffix. Requiring \s+ before suffix fixes III vs II. nameWithoutSuffix <-sub(paste0("^(.*)\\s+", suffix_pattern, "\\s*$"), "\\1", namePart, ignore.case =TRUE) suffix <-sub(paste0("^.*\\s+", suffix_pattern, "\\s*$"), "\\1", namePart, ignore.case =TRUE)list(suffix =strip_periods(suffix), nameWithSuffixRemoved =trimws(nameWithoutSuffix))}# normalize_suffix(suffix)# Arguments: suffix = string (e.g. "jr", "SR", "iii") or NA# Returns: correctly cased suffix: "Jr", "Sr", "II", "III", "IV", or NA if not recognized.# Example: normalize_suffix("jr") -> "Jr"normalize_suffix <-function(suffix) {if (is.na(suffix) ||trimws(suffix) =="") return(NA) s <-tolower(trimws(suffix))if (s %in%c("jr", "jr.")) return("Jr")if (s %in%c("sr", "sr.")) return("Sr")if (s =="ii") return("II")if (s =="iii") return("III")if (s =="iv") return("IV") suffix}# has_trailing_title(namePart)# Arguments: namePart = string that might end with "Ph.D" or "PhD"# Returns: TRUE if it ends with that.trailing_phd_regex <-paste0("\\s+", # space before Ph.D"Ph\\.?D\\.?", # Ph.D or PhD (periods optional)"\\s*$"# optional spaces at end)has_trailing_title <-function(namePart) {grepl(trailing_phd_regex, namePart, ignore.case =TRUE)}# extract_trailing_phd(namePart)# Returns: list(titlePart = "PhD" or NA, nameWithTitleRemoved = string).extract_trailing_phd <-function(namePart) {if (!has_trailing_title(namePart)) {return(list(titlePart =NA, nameWithTitleRemoved =trimws(namePart))) } tit <-sub(paste0(".*", trailing_phd_regex), "\\1", namePart, ignore.case =TRUE) nameWithout <-sub(trailing_phd_regex, "", namePart, ignore.case =TRUE)list(titlePart =strip_periods(tit), nameWithTitleRemoved =trimws(nameWithout))}# has_leading_title(namePart)# TRUE if name starts with Dr, Mr, Ms, Mrs (with optional period).leading_title_regex <-paste0("^(Dr\\.?|Mr\\.?|Ms\\.?|Mrs\\.?)", # title at start"\\s+"# at least one space after)has_leading_title <-function(namePart) {grepl(leading_title_regex, namePart, ignore.case =TRUE)}# extract_leading_title(namePart)# Returns: list(titlePart = "Dr" or NA, nameWithTitleRemoved = string).extract_leading_title <-function(namePart) {if (!has_leading_title(namePart)) {return(list(titlePart =NA, nameWithTitleRemoved =trimws(namePart))) } tit <-sub(paste0(leading_title_regex, ".*"), "\\1", namePart, ignore.case =TRUE) nameWithout <-sub(leading_title_regex, "", namePart, ignore.case =TRUE)list(titlePart =strip_periods(tit), nameWithTitleRemoved =trimws(nameWithout))}# parse_name_last_comma_first(namePart)# For names like "Last, First" or "Last, First Middle".# Arguments: namePart = string containing exactly one comma# Returns: list(first =, middle =, last =). Periods stripped from each.parse_name_last_comma_first <-function(namePart) { parts <-strsplit(namePart, ",\\s*")[[1]] last <-strip_periods(trimws(parts[1])) firstAndMiddle <-strsplit(trimws(parts[2]), "\\s+")[[1]] first <-strip_periods(firstAndMiddle[1]) middle <-if (length(firstAndMiddle) >=2) strip_periods(firstAndMiddle[2]) elseNAlist(first = first, middle = middle, last = last)}# parse_name_first_last(namePart)# For names like "First Last", "First Middle Last", or just "Last".# Arguments: namePart = string with no comma# Returns: list(first =, middle =, last =).parse_name_first_last <-function(namePart) { words <-strsplit(trimws(namePart), "\\s+")[[1]]if (length(words) ==1) {list(first =NA, middle =NA, last =strip_periods(words[1])) } elseif (length(words) ==2) {list(first =strip_periods(words[1]), middle =NA, last =strip_periods(words[2])) } else { first <-strip_periods(words[1]) middle <-strip_periods(words[2]) last <-paste(strip_periods(words[3:length(words)]), collapse =" ")list(first = first, middle = middle, last = last) }}# parse_one_employee(employeeName)# Arguments: employeeName = one raw employee string (e.g. "Dr. A. Davis", "Schwartz, Joe")# Returns: list with employeeTitle, employeeFirst, employeeMiddle, employeeLast,# employeeSuffix, employeeFullName. All name parts without periods.# Example: parse_one_employee("Conway, Ella Sr.") -> first=Ella, last=Conway, suffix=Sr, title=NAparse_one_employee <-function(employeeName) { raw <-trimws(employeeName) outSuffix <-extract_suffix(raw) raw <- outSuffix$nameWithSuffixRemoved outPhd <-extract_trailing_phd(raw) title <- outPhd$titlePart raw <- outPhd$nameWithTitleRemoved outLead <-extract_leading_title(raw)if (!is.na(outLead$titlePart)) { title <-if (is.na(title) || title =="") outLead$titlePart elsepaste0(outLead$titlePart, " ", title) } raw <- outLead$nameWithTitleRemovedif (is.na(title) || title =="") title <-NAif (length(strsplit(raw, ",\\s*")[[1]]) >=2) { parsed <-parse_name_last_comma_first(raw) } else { parsed <-parse_name_first_last(raw) } first <-to_sentence_case(parsed$first) middle <-to_sentence_case(parsed$middle) last <-to_sentence_case(parsed$last) suffix <-normalize_suffix(outSuffix$suffix) fullParts <-c(title, first, middle, last, suffix) fullParts <- fullParts[!is.na(fullParts) & fullParts !=""] employeeFullName <-paste(fullParts, collapse =" ")list(employeeTitle = title, employeeFirst = first, employeeMiddle = middle,employeeLast = last, employeeSuffix = suffix, employeeFullName = employeeFullName)}# parse_employee_column(employeeNames)# Arguments: employeeNames = character vector of raw employee strings# Returns: data frame with columns employeeFullName, employeeTitle, employeeFirst,# employeeMiddle, employeeLast, employeeSuffix.parse_employee_column <-function(employeeNames) { rows <-lapply(employeeNames, parse_one_employee) out <-as.data.frame(do.call(rbind, lapply(rows, unlist)), stringsAsFactors =FALSE) out <- out[c("employeeFullName", "employeeTitle", "employeeFirst", "employeeMiddle", "employeeLast", "employeeSuffix")] out}# =============================================================================# DATE PARSING# =============================================================================# is_iso_date(dateStr)# Returns TRUE if string is already yyyy-mm-dd (e.g. "2023-01-19").iso_date_regex <-paste0("^[0-9]{4}", # 4 digits at start (year)"-","[0-9]{1,2}", # 1 or 2 digits (month)"-","[0-9]{1,2}$"# 1 or 2 digits at end (day))is_iso_date <-function(dateStr) {grepl(iso_date_regex, trimws(dateStr))}# is_month_name_date(dateStr)# Returns TRUE for strings like "Jan 19, 2023" or "January 19, 2023".month_name_date_regex <-paste0("^[A-Za-z]+", # month name (letters)"\\s+","[0-9]{1,2}", # day",\\s*","[0-9]{2,4}$"# 2- or 4-digit year)is_month_name_date <-function(dateStr) {grepl(month_name_date_regex, trimws(dateStr))}# parse_one_date(dateStr)# Arguments: dateStr = one date string in any of the supported formats# Returns: string "yyyy-mm-dd" or NA if unparseable.# Example: parse_one_date("1/19/2023") -> "2023-01-19"parse_one_date <-function(dateStr) { s <-trimws(dateStr)if (is_iso_date(s)) return(s)if (is_month_name_date(s)) { d <-as.Date(s, format ="%b %d, %Y")if (is.na(d)) d <-as.Date(s, format ="%B %d, %Y")return(format(d, "%Y-%m-%d")) } date_parts <-strsplit(s, "[/-]")[[1]]if (length(date_parts) !=3) return(NA) n1 <-as.integer(date_parts[1]) n2 <-as.integer(date_parts[2]) n3 <-as.integer(date_parts[3]) len1 <-nchar(date_parts[1]) len3 <-nchar(date_parts[3])if (len1 ==4) { y <- n1; m <- n2; d <- n3 } elseif (len3 ==4) { y <- n3if (n1 >12) { d <- n1; m <- n2 } elseif (n2 >12) { m <- n1; d <- n2 } else { m <- n1; d <- n2 } } else { y <-2000+ (n3 -2000) %%100if (n1 >12) { d <- n1; m <- n2 } elseif (n2 >12) { m <- n1; d <- n2 } else { m <- n1; d <- n2 } }sprintf("%04d-%02d-%02d", y, m, d)}# parse_dates(dateStrings)# Arguments: dateStrings = character vector of date strings# Returns: character vector of same length, each element "yyyy-mm-dd" or NA.parse_dates <-function(dateStrings) {sapply(trimws(dateStrings), parse_one_date)}# =============================================================================# AMOUNT, CATEGORY, COMMENTS# =============================================================================# word_to_number(word)# Arguments: word = string with a number in words (e.g. "five hundred", "twenty-one",# "one thousand two hundred thirty-four"). Hyphens and "and" are allowed.# Returns: numeric value, or NA if the string is not a recognized number phrase.# Examples: word_to_number("five hundred") -> 500; word_to_number("twenty-one") -> 21;# word_to_number("one thousand two hundred thirty four") -> 1234.# Supports: zero–nine, ten–nineteen, twenty–ninety, hundred, thousand, million.word_to_number <-function(word) { w <-tolower(trimws(word)) w <-gsub("-", " ", w) w <-gsub("\\band\\b", " ", w) words <-strsplit(w, "\\s+")[[1]] words <- words[words !=""]if (length(words) ==0) return(NA) ones <-c(zero =0, one =1, two =2, three =3, four =4, five =5, six =6, seven =7, eight =8, nine =9) teens <-c(ten =10, eleven =11, twelve =12, thirteen =13, fourteen =14, fifteen =15,sixteen =16, seventeen =17, eighteen =18, nineteen =19) tens <-c(twenty =20, thirty =30, forty =40, fifty =50, sixty =60, seventy =70, eighty =80, ninety =90) total <-0 current <-0 i <-1while (i <=length(words)) { x <- words[i]if (x %in%names(ones)) { current <- current + ones[x] i <- i +1 } elseif (x %in%names(teens)) { current <- current + teens[x] i <- i +1 } elseif (x %in%names(tens)) { current <- current + tens[x] i <- i +1 } elseif (x =="hundred") {if (current ==0) current <-1 current <- current *100 i <- i +1 } elseif (x =="thousand") {if (current ==0) current <-1 total <- total + current *1000 current <-0 i <- i +1 } elseif (x =="million") {if (current ==0) current <-1 total <- total + current *1000000 current <-0 i <- i +1 } else {return(NA) } } total + current}# clean_amounts(amountStrings)# Arguments: amountStrings = character vector (e.g. "$27.00", "1,234.56", "five hundred")# Returns: numeric vector. Removes $ and commas; converts word amounts when possible.clean_amounts <-function(amountStrings) { x <-trimws(amountStrings) result <-numeric(length(x))for (i inseq_along(x)) { s <-gsub("[$,]", "", x[i]) s <-trimws(s)if (grepl("^[0-9.]+$", s)) { result[i] <-as.numeric(s) } else { result[i] <-word_to_number(x[i]) } } result}# clean_categories(categoryStrings)# Arguments: categoryStrings = character vector (e.g. "FOOD", "office suppl.")# Returns: character vector, lowercase, with abbreviations normalized# (e.g. "office suppl." and "Office Stuff" -> "office supplies", "LUNCH" -> "food").office_suppl_regex <-paste0("\\boffice\\s+suppl\\.?\\s*$"# "office suppl" or "office suppl." at end of string)office_stuff_regex <-"\\boffice\\s+stuff\\s*$"lunch_regex <-"\\blunch\\s*$"clean_categories <-function(categoryStrings) { x <-tolower(trimws(categoryStrings)) x <-sub(office_suppl_regex, "office supplies", x) x <-sub(office_stuff_regex, "office supplies", x) x <-sub(lunch_regex, "food", x) x}# clean_comments(commentStrings)# Arguments: commentStrings = character vector; can contain placeholders like "---", "NA"# Returns: character vector; placeholders and empty strings replaced with NA.placeholder_regex <-"^(---|NA|n/a)$"clean_comments <-function(commentStrings) { x <-trimws(as.character(commentStrings)) x[grepl(placeholder_regex, x, ignore.case =TRUE)] <-NA x[x ==""] <-NA x}# =============================================================================# MAIN CLEANUP# =============================================================================# cleanUp(expensesDF)# Arguments: expensesDF = data frame with columns employee, date, amount, category, comments# Returns: data frame with employee replaced by employeeFullName, employeeTitle,# employeeFirst, employeeMiddle, employeeLast, employeeSuffix; and with# date, amount, category, comments cleaned to consistent formats.# Example: cleanUp(expenses) -> wide table with cleaned columnscleanup <-function(expensesDF) { df <-as.data.frame(expensesDF, stringsAsFactors =FALSE) emp <-parse_employee_column(df$employee) df$employee <-NULL df <-cbind(emp, df) df$date <-parse_dates(df$date) df$amount <-clean_amounts(df$amount) df$category <-clean_categories(df$category) df$comments <-clean_comments(df$comments) df}cleanup(expenses)
employeeFullName employeeTitle employeeFirst employeeMiddle employeeLast employeeSuffix date amount category comments
1 Sue Smith <NA> Sue <NA> Smith <NA> 2023-01-19 59.99 food lunch with client
2 Joe Schwartz <NA> Joe <NA> Schwartz <NA> 2023-01-19 27.00 office supplies paper for printer
3 Mike David Harris <NA> Mike David Harris <NA> 2023-01-19 25.00 office supplies <NA>
4 Dr A Davis Dr A <NA> Davis <NA> 2023-01-19 500.00 food <NA>
5 Dr Jones Dr <NA> <NA> Jones <NA> 2023-01-19 1234.56 office supplies chairs
6 S Jones Jr <NA> S <NA> Jones Jr 2023-01-19 1000.00 office supplies desk
7 Ella Conway Sr <NA> Ella <NA> Conway Sr 2023-01-19 35.23 food <NA>
8 Tom Brown <NA> Tom <NA> Brown <NA> 2023-02-15 42.50 food dinner meeting
9 Kate Williams <NA> Kate <NA> Williams <NA> 2023-02-15 89.99 office supplies toner cartridge
10 Robert James Lee <NA> Robert James Lee <NA> 2023-02-15 75.00 office supplies <NA>
11 Mike Green Jr <NA> Mike <NA> Green Jr 2023-02-15 500.00 office supplies <NA>
12 Dr B Wilson Dr B <NA> Wilson <NA> 2023-02-15 200.00 food <NA>
13 Dr Smith Dr <NA> <NA> Smith <NA> 2023-02-15 2345.67 office supplies monitor
14 R Brown Jr <NA> R <NA> Brown Jr 2023-03-02 750.00 office supplies filing cabinet
15 Ana Martinez Sr <NA> Ana <NA> Martinez Sr 2023-02-15 48.75 food <NA>
16 Mary Johnson <NA> Mary <NA> Johnson <NA> 2023-03-22 63.45 food team lunch
17 Sam Cooper <NA> Sam <NA> Cooper <NA> 2023-03-22 156.00 office supplies office decoration
18 Peter Michael Wu <NA> Peter Michael Wu <NA> 2023-03-22 50.00 office supplies <NA>
19 Dr C Taylor Dr C <NA> Taylor <NA> 2023-02-03 1000.00 food <NA>
20 Drake Wilson <NA> Drake <NA> Wilson <NA> 2024-03-14 500.00 office supplies printer
21 Dr Chen Dr <NA> <NA> Chen <NA> 2023-03-02 3456.78 office supplies laptop
22 T Wilson Jr <NA> T <NA> Wilson Jr <NA> 500.00 office supplies printer
23 Eva Rodriguez Sr <NA> Eva <NA> Rodriguez Sr 2023-03-22 92.45 food <NA>
24 Pat Williams <NA> Pat <NA> Williams <NA> 2023-02-20 150.00 travel train ticket
25 Wei Chen <NA> Wei <NA> Chen <NA> 2023-02-20 89.50 travel hotel
26 Jennifer Ann Martinez <NA> Jennifer Ann Martinez <NA> 2023-02-21 42.00 food <NA>
27 Ms B Kim Ms B <NA> Kim <NA> 2023-02-21 12.50 office supplies stapler
28 J Thompson <NA> J <NA> Thompson <NA> 2023-02-21 2500.00 travel flight
29 R Patel <NA> R <NA> Patel <NA> 2023-02-21 75.00 office supplies notebooks
30 Maria Garcia Jr <NA> Maria <NA> Garcia Jr 2023-02-22 120.00 food team dinner
31 Dr Evans Dr <NA> <NA> Evans <NA> 2023-03-01 95.99 office supplies supplies
32 Thompson III <NA> <NA> <NA> Thompson III 2023-03-05 200.00 travel conference
33 Tom Brown <NA> Tom <NA> Brown <NA> 2023-03-10 65.00 food client lunch
34 Kate Williams <NA> Kate <NA> Williams <NA> 2023-03-01 45.00 office supplies paper restock
35 Robert James Lee <NA> Robert James Lee <NA> 2023-03-01 25.00 office supplies mouse pad
36 Mike Green Jr <NA> Mike <NA> Green Jr 2023-03-05 150.00 office supplies desk lamp
37 Dr B Wilson Dr B <NA> Wilson <NA> 2023-02-20 85.00 food team lunch
38 Dr Smith Dr <NA> <NA> Smith <NA> 2023-03-01 800.00 office supplies tablet
39 R Brown Jr <NA> R <NA> Brown Jr 2023-03-10 300.00 office supplies office chair
40 Ana Martinez Sr <NA> Ana <NA> Martinez Sr 2023-02-20 30.00 food coffee meeting
41 Mary Johnson <NA> Mary <NA> Johnson <NA> 2023-03-25 22.00 food snacks
42 Sam Cooper <NA> Sam <NA> Cooper <NA> 2023-03-25 75.00 office supplies whiteboard markers
43 Peter Michael Wu <NA> Peter Michael Wu <NA> 2023-03-25 35.00 office supplies mouse
44 Dr C Taylor Dr C <NA> Taylor <NA> 2023-03-10 120.00 food staff lunch
45 Drake Wilson <NA> Drake <NA> Wilson <NA> 2024-04-01 250.00 office supplies scanner
46 Dr Chen Dr <NA> <NA> Chen <NA> 2023-03-10 900.00 office supplies monitor
47 T Wilson Jr <NA> T <NA> Wilson Jr <NA> 200.00 office supplies router
48 Eva Rodriguez Sr <NA> Eva <NA> Rodriguez Sr 2023-04-01 55.00 food team outing
49 Mike Turchini <NA> Mike <NA> Turchini <NA> 2024-01-31 100.00 food <NA>
24.2.7 Comment Format Questions
Question 16
Write R code using regex to identify comments that are either NA, “NA”, or consist entirely of dashes (e.g “—” and “—–”) or are the empty string, i.e. ““.
Given the data above, your output should show the following:
Question 17
Write R code using regex to identify comments that describe meetings (containing words like “lunch”, “dinner”, “client”, “team”).
Given the data above, your output should show the following: