21 21. “Cleaning” data

Often data comes to us in a “messy” format. One of the “less glamorous” but very important jobs of anyone working with data is to know how to “clean up” “messy” data. In other words to put the data into a common format.

For example, suppose that employees in a company submit expense reports every month for travel and other business expenses. It would be best if the data entry system (e.g. a website) enforced a standard format for the data. However, this doesn’t always happen. Sometimes the system tries to enforce a specific format but people “misuse” the system leading to inconsistencies between different entries. Sometimes data from different data entry systems are combined into a single data set.

The data below represents expenses from various employees. There are many inconsistencies between different entries. For example, some names are in “first last” format and some names are in “last, first” format. There are many other formatting inconsistencies that are listed below.

library(tibble)
expenses = tribble(
  ~employee,           ~date,          ~amount,          ~category,         ~comments,
  "Sue Smith",         "1/19/2023",    "59.99",         "food",            "lunch with client",
  "Schwartz, Joe",     "01/19/2023",   "$27.00",        "office supplies", "paper for printer",
  "mike david harris", "2023-01-19",   "25",            "Office Stuff",    NA,
  "Dr. A. Davis",      "19/1/2023",    "five hundred",  "FOOD",            "NA",
  "Dr Jones",          "1/19/23",      "1,234.56",      "office suppl.",   "chairs",
  "S. Jones Jr",       "19/1/23",      "1000",          "Office supplies", "desk",
  "Conway, Ella Sr.",  "Jan 19, 2023", "$35.23",        "LUNCH",           "---",
)

as.data.frame(expenses)

           employee         date       amount        category          comments
1         Sue Smith    1/19/2023        59.99            food lunch with client
2     Schwartz, Joe   01/19/2023       $27.00 office supplies paper for printer
3 mike david harris   2023-01-19           25    Office Stuff              <NA>
4      Dr. A. Davis    19/1/2023 five hundred            FOOD                NA
5          Dr Jones      1/19/23     1,234.56   office suppl.            chairs
6       S. Jones Jr      19/1/23         1000 Office supplies              desk
7  Conway, Ella Sr. Jan 19, 2023       $35.23           LUNCH               ---

The following are many of the inconsistencies

Inconsistencies with the name entries
- some names are in “first last” format and some names are in “last, first” format.
- some names are capitalized, some aren’t
- some names have middle include middle names, some don’t
- some names include titles, some don’t
- some names include a title, e.g. Dr. , while some don’t
- some of the titles contain a period after the title, some don’t
- some names include a suffix (e.g. Jr), some don’t
Inconsistencies with dates
- some dates are in mm/dd/yyyy format (USA style)
- some dates are in dd/mm/yyyy format (European style)
- some dates are in yyyy-mm-dd format (standard for many computer systems)
- some dates use 4 digit years, some use 2 digit years
- some dates have a leading zero for single digit days or months
There are additional inconsistencies with the amount, category and comments columns

22 What you need to do

Write a function named cleanUp that takes the dataframe as an argument and returns a cleaned dataframe where all of the values in each column follow a consistent structure.