25  25. dplyr verbs - select, filter, etc.

25.1 select()

Use the select function to get just the specified columns.

The first argument to the select function is named .data (notice the period before .data). It is used to specify which tibble or dataframe you are working with. The other arguments are used to specify which columns you’d like to see from the dataframe/tibble. Note that there are NO quotes around the column names (this differs from the base-r way of specifying column names in a dataframe).

# Show just the specified columns from the sales tibble
select(.data=sales, first, last, salesInThousands)
Error in select(.data = sales, first, last, salesInThousands): could not find function "select"

The columns appear in whatever order you specified in the select function. If you want to bring a few columns to the front followed by the rest of the columns, you can use the everything() function.

# Show the specified columns first, followed by the other columns.
select(.data=sales, first, last, everything())
Error in select(.data = sales, first, last, everything()): could not find function "select"

25.2 filter()

The filter function is used to retrieve the rows from a tibble that match a condition.

  • The first argument of the filter function is also .data. Just as with the select function this argument for the filter function is expected to be a tibble or a dataframe.

  • The next argument to filter is a logical expression. You can use the names of the columns of the tibble/dataframe in the expression. filter will examine each row of the tibble/dataframe and only return those rows whose values match the stated condition.

The following code displays only those rows whose salesInThousands is less than 100.

filter(.data=sales, salesInThousands < 100)
Error in filter(.data = sales, salesInThousands < 100): unused argument (.data = sales)

25.3 Using the dplyr functions in a “pipeline”

The tidyverse functions are designed so that you can be apply them one after the other. The clearest way to do that is to “pipe” the output of one function into the input of another.

The following pipeline first applies select to generate a new tibble that only contains the specified columns. Then the filter function then chooses only those rows that meet the specified condition.

sales |> 
  select(first, last, salesInThousands) |>
  filter(salesInThousands < 100)
Error in select(sales, first, last, salesInThousands): could not find function "select"
# In this example we could apply the filter before select.
# It technically doesn't matter which order you apply the functions in 
# as long as you realize that the output of one function is fed into 
# the input of the next function.

sales |> 
  filter(salesInThousands < 100) |>
  select(first, last, salesInThousands) 
Error in select(filter(sales, salesInThousands < 100), first, last, salesInThousands): could not find function "select"
# You can save the results in a variable as follows
lowSales = 
  sales |> 
    filter(salesInThousands < 100) |>
    select(first, last, salesInThousands) 
Error in select(filter(sales, salesInThousands < 100), first, last, salesInThousands): could not find function "select"
# Now the lowSales variable contains the results
lowSales
Error: object 'lowSales' not found
# You can continue to use the new variable for further processing
lowSales |> filter(salesInThousands > 80)
Error: object 'lowSales' not found

The dplyr package contains many different functions. A small group of these functions are known as the dplyr “verbs” and constitue the primary functions in the package. There are many other “helper” functions in the package.

Each “verb” function is used to manipulate tibbles (or dataframes) in a very specific way. When the different verb functions are combined together in a pipeline you can engineer very sophisticated manipulations of the data in the tibble/dataframe.

The main “verb functions” in dplyr are:

NOTE - The concept of “verb” functions was inspired greatly by the “clauses” the SQL SELECT statement. We will study SQL later.

25.4 arrange()

The arrange function reorders the rows based on specified sorting rules.

# Arrange all of the rows based on salesInThousands. 
sales |> 
  arrange(salesInThousands)
Error in arrange(sales, salesInThousands): could not find function "arrange"
# Select the first, last names and salesInThousands
# for only those rows for which salesInThousands is less than 100.
# Arrange the rows so that they appear in order of the salesInThousands.
sales |> 
  select(first, last, salesInThousands) |>
  filter(salesInThousands < 100) |>
  arrange(salesInThousands)
Error in arrange(filter(select(sales, first, last, salesInThousands), : could not find function "arrange"

25.4.1 arrange by more than one column

If you pass more than one column name to the arrange function

  • the rows are arranged by the first specified column

  • all rows that have the same value of the first column passed to arrange are further arranged within that cluster of rows by the 2nd column name that was passed to arrange

  • You can specify as many column names as you like in the arrange function. Column names that appear later in the call to arrange only effect the final order for those rows that share the same value for the earlier column names in the arrange function.

See the examples below.

# show the rows in alphabetical order based on the names of the salespeople.
sales |>
  arrange(last,first) |>
  print(n=Inf)
Error in arrange(sales, last, first): could not find function "arrange"
# Arrange by title, then by last names and finally by first names
sales |>
  arrange(title, last, first) |>
  print(n=Inf)
Error in arrange(sales, title, last, first): could not find function "arrange"
# It might help to read if you display the columns you are sorting by first.
# To do so, simply add a select to the pipeline. 
#
# In this example, it makes no difference
# if the select appears before or after the arrange function.
#
# Notice that the names are alphabetically arranged only with a group of
# rows that all have the same title.

sales |>
  select(title, last, first, everything()) |>
  arrange(title, last, first) |>
  print(n=Inf)
Error in arrange(select(sales, title, last, first, everything()), title, : could not find function "arrange"
# We can get even more fine grained. This time we will arrange
# by title, then by region, then by last names and finally by first names.
#
# Notice that the names are alphabetically arranged only with a group of
# rows that all have the same title and region.

sales |>
  select(title, region, last, first, everything()) |>
  arrange(title, region, last, first) |>
  print(n=Inf)
Error in arrange(select(sales, title, region, last, first, everything()), : could not find function "arrange"

25.4.2 desc(SOME_COLUMN) - arrange in descending order

Use the desc function to specify “descending” order. See the examples.

# In the following, the rows with the highest values
# of salesInThousands are at the top.

sales |> 
  filter(salesInThousands < 100) |>
  select(first, last, salesInThousands) |>
  arrange(desc(salesInThousands))  |>
  print(n=Inf)
Error in arrange(select(filter(sales, salesInThousands < 100), first, : could not find function "arrange"
# Arrange alphabetically by the region.
# Within each region, show the sales in decreasing order.
sales |>
  select(region, baseInThousands, everything()) |>
  arrange(region,  desc(baseInThousands)) |>
  print(n=Inf)
Error in arrange(select(sales, region, baseInThousands, everything()), : could not find function "arrange"
# Show the names, titles and salesInThousands for
# people who sell to the usa and for people who sell to europe.
#
# Arrange the results so that all titles are grouped together.
# Within the rows of a particular title sort the results in 
# descending order based on the salesInThousands so that the 
# rows with greater salesInThousands appear earlier.
#
# Only show salespeople for the usa and europe

sales |>
  select(first, last, title, salesInThousands, region) |>
  filter(region %in% c("usa", "europe")) |>
  arrange(title, desc(salesInThousands))
Error in arrange(filter(select(sales, first, last, title, salesInThousands, : could not find function "arrange"
# Careful - if you do not include region in the select , it won't work
# ERROR - object 'region' not found

sales |>
  select(first, last, title, salesInThousands) |>  # region is missing here
  filter(region %in% c("usa", "europe")) |>        # but you need it here
  arrange(title, desc(salesInThousands))
Error in arrange(filter(select(sales, first, last, title, salesInThousands), : could not find function "arrange"
# Rearrange the order of the function calls to get it to work

sales |>
  arrange(title, desc(salesInThousands)) |>   
  filter(region %in% c("usa", "europe")) |>       # we use region here
  select(first, last, title, salesInThousands)    # we can now exclude region
Error in select(filter(arrange(sales, title, desc(salesInThousands)), : could not find function "select"

25.5 mutate - add new columns, modify columns, delete columns

Use the mutate function to add new columns. These new columns can be based on existing data. See the examples below.

# All salespeople get a commission equal to 10% of their sales.
# Create a new column with the name commission that shows the value of their commission.

# The following code creates the new column as the last column in the tibble.
sales |>
  mutate(commission=0.10*salesInThousands, takeHome=baseInThousands + 0.10*salesInThousands ) |>
  print(width=Inf)
Error in mutate(sales, commission = 0.1 * salesInThousands, takeHome = baseInThousands + : could not find function "mutate"

25.5.1 .before and .after

Use the .before and .after argument to mutate to specify where the new columns should appear in the tibble. See the examples below.

# Place the newly created column at the beginning of the tibble,
# i.e. .before the 1st column.

sales |>
  mutate(commission=0.10*salesInThousands, .before=1)
Error in mutate(sales, commission = 0.1 * salesInThousands, .before = 1): could not find function "mutate"
# Place the newly created column as the 3rd column
# i.e. .after the 2nd column.
# (.before=4 would also work)

sales |>
  mutate(commission=0.10*salesInThousands, .after=2)
Error in mutate(sales, commission = 0.1 * salesInThousands, .after = 2): could not find function "mutate"
# .before and .after can also refer to specific columns
# For example the following also places the commission after the last name column
sales |>
  mutate(commission=0.10*salesInThousands, .after=last)
Error in mutate(sales, commission = 0.1 * salesInThousands, .after = last): could not find function "mutate"

25.5.2 a new column that depends on other new columns

The following code creates two new columns.

The commission column is created as in the last example.

The takeHome column is the total take home pay for the salesperson, i.e.  their baseInThousands plus the commission.

If you create both columns in a single call to mutate, you will need to repeat the calculation for the commission twice (see examples below). An alternative is to call the mutate function twice. For tibbles with many rows this can take longer to process but is less error prone when you write the code.

# All salespeople get a commission equal to 10% of their sales.
# Create a new column with the name commission that shows the value of their commission.
# Create another column called takeHome which has their total takehome pay.

# We can do it all in one call to mutate.
# However, that require repeating the code for calculating for the commission.
sales |>
  mutate(commission=0.10*salesInThousands, takeHome=baseInThousands + 0.10*salesInThousands, .before=1 ) |>
  print(width=Inf)
Error in mutate(sales, commission = 0.1 * salesInThousands, takeHome = baseInThousands + : could not find function "mutate"
# By separating the code into two calls to mutate, we can refer to the 
# newly created "commission" column when calculating the takeHome column.
# Note that the 2nd call to mutate below receives a tibble that already has a
# commission column as the first column. That is why .before=2 (and not .before=1)
sales |>
  mutate(commission=0.10*salesInThousands, .before=1) |>
  mutate(takeHome=baseInThousands + commission, .before=2) |>
  print(width=Inf)
Error in mutate(mutate(sales, commission = 0.1 * salesInThousands, .before = 1), : could not find function "mutate"

25.5.3 modifying existing columns

You can modify existing columns by specifying an existing column name in mutate.

25.5.4 deleting columns with mutate

You can delete existing columns by setting the value of the column to NULL.

25.6 distinct()

The distinct function eliminates any duplicate rows from a tibble. Only rows for which ALL of the values are equal to ALL the values in another row will be eliminated.

See the examples.

# Show the just the region for each row in the tibble
sales |>
  select (region) |>
  print(n=Inf)
Error in select(sales, region): could not find function "select"
# Same thing using distinct()
# Show just the distinct (ie. different) regions
sales |>
  select (region) |>
  distinct() |>       # show just the distinct rows
  print(n=Inf)
Error in distinct(select(sales, region)): could not find function "distinct"
###############################################################
# NOTE - all values in the row must be exactly the same
# for the distinct() function to eliminate the duplicate rows.
###############################################################

# Show the region and title for rows 
# whose region is either europe or asia.
# Arrange the output so they come up sorted.

sales |>
  select(region, title) |>
  filter(region %in% c('europe', 'asia')) |>
  arrange(region, title) |>
  print(n=Inf)
Error in arrange(filter(select(sales, region, title), region %in% c("europe", : could not find function "arrange"
# Same thing using distinct()
# Show just the distinct rows from the previous command

sales |>
  select(region, title) |>
  filter(region %in% c('europe', 'asia')) |>
  distinct() |>                                 # eliminate duplicate rows
  arrange(region, title) |>
  print(n=Inf)
Error in arrange(distinct(filter(select(sales, region, title), region %in% : could not find function "arrange"
# Notice that in the previous output, there were multiple
# rows that contained "europe", multiple rows that contained "asia"
# as well as multiple rows that contained each of the different titles.
# However - there were NO rows that contained exact duplicates of both
# the region and the title.
#
# The moral of the story is that distinct() will only elminate rows 
# that are EXACT duplicates in EVERY column.
sales |>
  select(region, title, everything()) |>
  arrange(region, title)
Error in arrange(select(sales, region, title, everything()), region, title): could not find function "arrange"
sales |>
  select(region, title) |>
  arrange(region, title)
Error in arrange(select(sales, region, title), region, title): could not find function "arrange"
sales |>
  select(region,title) |>
  arrange(region, title) |>
  distinct()
Error in distinct(arrange(select(sales, region, title), region, title)): could not find function "distinct"

25.7 summarize and group_by

The summarize() and group_by() functions work together. We’ll explain exactly how below …

25.7.1 aggregate functions

The summarize function is used to condense the contents of several rows of data into a single row of data. It does this by using functions such as min(), max(), mean(), etc. These functions can take several values and return a single value. Such functions are known as aggregate functions.

25.7.2 summarize (without group_by)

The summarize function is often preceded by a call to the group_by() function (which we will cover in the next section). group_by() is used to contol which rows of data will be affected by a subsequent call to the summarize() function. This will be explained in more detail below. For now, we will just focus on how summarize() works when it is NOT preceded by a call to group_by().

When summarize() is NOT preceded by a call to group_by() then the job of the summarize() function is to return a single row of summary by applying aggregate functions (such as min, max, mean, etc) to columns of the tibble.

See the examples below.

# mean salesInThousands (for all rows)
# Note that this returns a tibble with exactly one row and one column.
sales |>
 summarize(mean(salesInThousands))
Error in summarize(sales, mean(salesInThousands)): could not find function "summarize"
# We can change the column name as shown below
sales |>
 summarize(averageSales = mean(salesInThousands))
Error in summarize(sales, averageSales = mean(salesInThousands)): could not find function "summarize"
# We can filter the rows before calculating the mean results (or use any other of the dplyr functions)
#
# Show the mean sales for just the USA.
sales |>
  filter(region == "usa") |>
  summarize(usaSales = mean(salesInThousands))
Error in summarize(filter(sales, region == "usa"), usaSales = mean(salesInThousands)): could not find function "summarize"
# We can get more than one column in our summary tibble
sales |>
  summarize(meanSales = mean(salesInThousands), maxSales=max(salesInThousands), minSales=min(salesInThousands),
            maxBase = max(baseInThousands), minBase=min(baseInThousands))
Error in summarize(sales, meanSales = mean(salesInThousands), maxSales = max(salesInThousands), : could not find function "summarize"

25.7.3 The n() function

The n() function returns the number of rows.

The n_distinct() function returns the number of rows that have distinct (i.e. different) values.

# Show all of the rows ordered by region and title
sales |>
  select(region, title, everything()) |>
  print(n=Inf)
Error in select(sales, region, title, everything()): could not find function "select"
# n() - returns the number of rows in the tibble
# n_distinct(COL1, COL2, ...) - returns the number of rows for which 
#   the specified columns taken all together are distinct among all other rows.
sales |>
  select(region, title, everything()) |>
  summarize(n(), n_distinct(region), n_distinct(title), n_distinct(region, title))
Error in summarize(select(sales, region, title, everything()), n(), n_distinct(region), : could not find function "summarize"

25.8 group_by and summarize (or summarise)

The examples above for summarize all return a single row of data. This is because they are summarizing all of the rows for the tibble that summarize received. Note that this is true even if the rows have been filtered first. For example:

# The following returns just one row.
# The summarize function is summarizing ALL of the rows 
# for the tibble that it was given.

sales %>% 
  filter(region == "usa") |>
  summarize(numberOfRows=n(), maxUsaSales = max(salesInThousands), minUsaSales=min(salesInThousands))
Error in summarize(sales %>% filter(region == "usa"), numberOfRows = n(), : could not find function "summarize"

The functions, group_by and summarize functions are designed to work together. The job of the group_by function is to separate the rows of data into different “groups” that will later be processed separately by the summarize function.

If you use a group_by function in general it is followed a call to summarize()

To understand the examples below, it is helpful to first see the rows sorted by region and title

sales |>
  select(region, title, everything()) |>
  arrange(region, title)
Error in arrange(select(sales, region, title, everything()), region, title): could not find function "arrange"

Notice how the rows that have the same region can be thought of as a “group” of rows for that region.

See the examples below

# Show summary information for each of the different regions.
# Each row in the output corresponds only to the rows for the region shown on that row of output.
sales |> 
  group_by(region) |>    # create a different group for each region
  summarize(numberOfRows=n(), mean(baseInThousands), mean(salesInThousands))
Error in summarize(group_by(sales, region), numberOfRows = n(), mean(baseInThousands), : could not find function "summarize"

Similarly we can group the rows by those with the same value in the title column.

# Show summary information for each of the different titles.
# Each row in the output corresponds only to the rows for the title shown on that row of output.
sales |> 
  group_by(title) |>    # create a different group for each title
  summarize(numberOfRows=n(), mean(baseInThousands), mean(salesInThousands))
Error in summarize(group_by(sales, title), numberOfRows = n(), mean(baseInThousands), : could not find function "summarize"

25.8.1 group_by without summarize has no noticeable effect

Note that on its own, the group_by function doesn’t appear to do anything. It only has a tangible effect when a summarize function is called after the group_by function

# group_by without a subsequent call to summarize doesn't appear to do anything special
sales |> 
  group_by(title)   
Error in group_by(sales, title): could not find function "group_by"

25.8.2 group_by more than one column

A call to group_by that has more than one column creates a separate group for the rows that have the same values for all of the specified columns. To understand this look at the following output:

asia_europe_sales = 
  sales |>
    select(region, title, baseInThousands) |>
    filter(region %in% c('europe', 'asia')) |>
    arrange(region, title)
Error in arrange(filter(select(sales, region, title, baseInThousands), : could not find function "arrange"
asia_europe_sales  |>
    print(n=Inf)
Error: object 'asia_europe_sales' not found

We can treat the rows that have the same values for both the region and title as a single group of rows. For example we can treat all of the rows for “asia Salesperson” as a group and the rows for “eroupe Manager” as a separate group. The output of the following command returns one row for each group.

# create groups based on region and sales
asia_europe_sales |> 
  group_by(region, title) |>
  summarize(n(), max(baseInThousands), mean(baseInThousands))|>
  print(n=Inf)
Error in summarize(group_by(asia_europe_sales, region, title), n(), max(baseInThousands), : could not find function "summarize"
Note

You might be wondering about the somewhat cryptic message from the previous command:

summarise() has grouped output by ‘region’. You can override using the .groups argument.

When group_by is called with more than one column name the subsequent summarize call will generate a message similar to the one shown above. You can ignore this message without any issue. If you want to understand it better - look at this.

25.8.3 order of columns in group_by is irrelevant

Note that it makes no difference which order you specify the column names in the call to group by. The following command reverses the order of the columns in the call to group_by but returns the exact same result as the previous command.

# create groups based on region and sales
asia_europe_sales |> 
  group_by(title, region) |>   # the order of the columns in group_by makes no difference
  summarize(n(), max(baseInThousands), mean(baseInThousands))|>
  print(n=Inf)
Error in summarize(group_by(asia_europe_sales, title, region), n(), max(baseInThousands), : could not find function "summarize"

25.8.4 some other aggregate functions

As mentioned above, an “aggregate function” is a function that can take several values as input but returns a single value. It is these types of functions that are used by summarize()

The following are some other commonly used aggregate functions that can be used in summarize. There are many, many others.

  • Statistics: mean(), median(), sd(), IQR(), mad()

  • Range: min(), max(),

  • Position: first(), last(), nth(),

  • Counting: n(), n_distinct()

  • Logical: any(), all()