Use the select function to get just the specified columns.
The first argument to the select function is named .data (notice the period before .data). It is used to specify which tibble or dataframe you are working with. The other arguments are used to specify which columns you’d like to see from the dataframe/tibble. Note that there are NO quotes around the column names (this differs from the base-r way of specifying column names in a dataframe).
library(dplyr, warn.conflicts=FALSE, quietly=TRUE)# Show just the specified columns from the sales tibbleselect(.data=sales, first, last, salesInThousands)
# A tibble: 24 × 3
first last salesInThousands
<chr> <chr> <dbl>
1 Joe Smith 100
2 Sam Lee 200
3 Sue Sallinger 150
4 Barb Brown 79
5 Jack Martin 40
6 Kate Carey 69
7 Emy Parsons 87
8 Monica Oja 98
9 Jim Kuruzovich 140
10 Zeke Cohen 250
# ℹ 14 more rows
The columns appear in whatever order you specified in the select function. If you want to bring a few columns to the front followed by the rest of the columns, you can use the everything() function.
# Show the specified columns first, followed by the other columns.select(.data=sales, first, last, everything())
# A tibble: 24 × 7
first last title region baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Joe Smith Mana… usa 40 100 10
2 Sam Lee Sale… usa 40 200 6
3 Sue Sallin… Sale… asia 30 150 10
4 Barb Brown Sr. … asia 35 79 4
5 Jack Martin Mana… europe 20 40 3
6 Kate Carey Sr. … europe 20 69 2
7 Emy Parsons Sale… africa 25 87 3
8 Monica Oja Sale… africa 35 98 10
9 Jim Kuruzo… Sr. … usa 35 140 3
10 Zeke Cohen Mana… usa 30 250 10
# ℹ 14 more rows
26.2 filter()
The filter function is used to retrieve the rows from a tibble that match a condition.
The first argument of the filter function is also .data. Just as with the select function this argument for the filter function is expected to be a tibble or a dataframe.
The next argument to filter is a logical expression. You can use the names of the columns of the tibble/dataframe in the expression. filter will examine each row of the tibble/dataframe and only return those rows whose values match the stated condition.
The following code displays only those rows whose salesInThousands is less than 100.
filter(.data=sales, salesInThousands <100)
# A tibble: 5 × 7
first last title region baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Barb Brown Sr. S… asia 35 79 4
2 Jack Martin Manag… europe 20 40 3
3 Kate Carey Sr. S… europe 20 69 2
4 Emy Parsons Sales… africa 25 87 3
5 Monica Oja Sales… africa 35 98 10
26.3 Using the dplyr functions in a “pipeline”
The tidyverse functions are designed so that you can be apply them one after the other. The clearest way to do that is to “pipe” the output of one function into the input of another.
The following pipeline first applies select to generate a new tibble that only contains the specified columns. Then the filter function then chooses only those rows that meet the specified condition.
sales |>select(first, last, salesInThousands) |>filter(salesInThousands <100)
# A tibble: 5 × 3
first last salesInThousands
<chr> <chr> <dbl>
1 Barb Brown 79
2 Jack Martin 40
3 Kate Carey 69
4 Emy Parsons 87
5 Monica Oja 98
# In this example we could apply the filter before select.# It technically doesn't matter which order you apply the functions in # as long as you realize that the output of one function is fed into # the input of the next function.sales |>filter(salesInThousands <100) |>select(first, last, salesInThousands)
# A tibble: 5 × 3
first last salesInThousands
<chr> <chr> <dbl>
1 Barb Brown 79
2 Jack Martin 40
3 Kate Carey 69
4 Emy Parsons 87
5 Monica Oja 98
# You can save the results in a variable as followslowSales = sales |>filter(salesInThousands <100) |>select(first, last, salesInThousands) # Now the lowSales variable contains the resultslowSales
# A tibble: 5 × 3
first last salesInThousands
<chr> <chr> <dbl>
1 Barb Brown 79
2 Jack Martin 40
3 Kate Carey 69
4 Emy Parsons 87
5 Monica Oja 98
# You can continue to use the new variable for further processinglowSales |>filter(salesInThousands >80)
# A tibble: 2 × 3
first last salesInThousands
<chr> <chr> <dbl>
1 Emy Parsons 87
2 Monica Oja 98
The dplyr package contains many different functions. A small group of these functions are known as the dplyr “verbs” and constitue the primary functions in the package. There are many other “helper” functions in the package.
Each “verb” function is used to manipulate tibbles (or dataframes) in a very specific way. When the different verb functions are combined together in a pipeline you can engineer very sophisticated manipulations of the data in the tibble/dataframe.
group_by(), ungroup(), sumarise() (same as summarize() with a “z”) - these work together with summary functions such as mean(), max(), min(), n() and others to summarize the data in the columns.
NOTE - The concept of “verb” functions was inspired greatly by the “clauses” the SQL SELECT statement. We will study SQL later.
26.4 arrange()
The arrange function reorders the rows based on specified sorting rules.
# Arrange all of the rows based on salesInThousands. sales |>arrange(salesInThousands)
# A tibble: 24 × 7
first last title region baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Jack Martin Mana… europe 20 40 3
2 Kate Carey Sr. … europe 20 69 2
3 Barb Brown Sr. … asia 35 79 4
4 Emy Parsons Sale… africa 25 87 3
5 Monica Oja Sale… africa 35 98 10
6 Joe Smith Mana… usa 40 100 10
7 Jack Aames Sale… usa 43 105 4
8 Larry Green Sr. … europe 20 113 4
9 Zeke Smith Sale… asia 20 120 3
10 Jim Kuruzo… Sr. … usa 35 140 3
# ℹ 14 more rows
# Select the first, last names and salesInThousands# for only those rows for which salesInThousands is less than 100.# Arrange the rows so that they appear in order of the salesInThousands.sales |>select(first, last, salesInThousands) |>filter(salesInThousands <100) |>arrange(salesInThousands)
# A tibble: 5 × 3
first last salesInThousands
<chr> <chr> <dbl>
1 Jack Martin 40
2 Kate Carey 69
3 Barb Brown 79
4 Emy Parsons 87
5 Monica Oja 98
26.4.1 arrange by more than one column
If you pass more than one column name to the arrange function
the rows are arranged by the first specified column
all rows that have the same value of the first column passed to arrange are further arranged within that cluster of rows by the 2nd column name that was passed to arrange
You can specify as many column names as you like in the arrange function. Column names that appear later in the call to arrange only effect the final order for those rows that share the same value for the earlier column names in the arrange function.
See the examples below.
# show the rows in alphabetical order based on the names of the salespeople.sales |>arrange(last,first) |>print(n=Inf)
# A tibble: 24 × 7
first last title region baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Barb Aames Sale… usa 21 255 7
2 Jack Aames Sale… usa 43 105 4
3 Sue Aames Sr. … africa 35 600 10
4 Hugh Black Sr. … africa 40 261 9
5 Barb Brown Sr. … asia 35 79 4
6 Jim Brown Sale… europe 50 167 2
7 Kate Carey Sr. … europe 20 69 2
8 Zeke Cohen Mana… usa 30 250 10
9 Larry Green Sr. … europe 20 113 4
10 Jim Kuruzo… Sr. … usa 35 140 3
11 Sam Lee Sale… usa 40 200 6
12 Sam Lincoln Mana… europe 30 500 2
13 Amy Markov… Sale… asia 46 340 3
14 Jack Martin Mana… europe 20 40 3
15 Monica Oja Sale… africa 35 98 10
16 Emy Parsons Sale… africa 25 87 3
17 Sue Sallin… Sale… asia 30 150 10
18 Joe Smith Mana… usa 40 100 10
19 Zeke Smith Sale… asia 20 120 3
20 Joe Washin… Sr. … europe 33 370 2
21 Laura White Mana… africa 20 281 8
22 Emy Zeitch… Mana… asia 34 166 4
23 Kate Zeitch… Sr. … usa 50 187 4
24 Monica Zeitch… Sale… asia 23 184 1
# Arrange by title, then by last names and finally by first namessales |>arrange(title, last, first) |>print(n=Inf)
# A tibble: 24 × 7
first last title region baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Zeke Cohen Mana… usa 30 250 10
2 Sam Lincoln Mana… europe 30 500 2
3 Jack Martin Mana… europe 20 40 3
4 Joe Smith Mana… usa 40 100 10
5 Laura White Mana… africa 20 281 8
6 Emy Zeitch… Mana… asia 34 166 4
7 Barb Aames Sale… usa 21 255 7
8 Jack Aames Sale… usa 43 105 4
9 Jim Brown Sale… europe 50 167 2
10 Sam Lee Sale… usa 40 200 6
11 Amy Markov… Sale… asia 46 340 3
12 Monica Oja Sale… africa 35 98 10
13 Emy Parsons Sale… africa 25 87 3
14 Sue Sallin… Sale… asia 30 150 10
15 Zeke Smith Sale… asia 20 120 3
16 Monica Zeitch… Sale… asia 23 184 1
17 Sue Aames Sr. … africa 35 600 10
18 Hugh Black Sr. … africa 40 261 9
19 Barb Brown Sr. … asia 35 79 4
20 Kate Carey Sr. … europe 20 69 2
21 Larry Green Sr. … europe 20 113 4
22 Jim Kuruzo… Sr. … usa 35 140 3
23 Joe Washin… Sr. … europe 33 370 2
24 Kate Zeitch… Sr. … usa 50 187 4
# It might help to read if you display the columns you are sorting by first.# To do so, simply add a select to the pipeline. ## In this example, it makes no difference# if the select appears before or after the arrange function.## Notice that the names are alphabetically arranged only with a group of# rows that all have the same title.sales |>select(title, last, first, everything()) |>arrange(title, last, first) |>print(n=Inf)
# A tibble: 24 × 7
title last first region baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Manager Cohen Zeke usa 30 250 10
2 Manager Linc… Sam europe 30 500 2
3 Manager Mart… Jack europe 20 40 3
4 Manager Smith Joe usa 40 100 10
5 Manager White Laura africa 20 281 8
6 Manager Zeit… Emy asia 34 166 4
7 Salespe… Aames Barb usa 21 255 7
8 Salespe… Aames Jack usa 43 105 4
9 Salespe… Brown Jim europe 50 167 2
10 Salespe… Lee Sam usa 40 200 6
11 Salespe… Mark… Amy asia 46 340 3
12 Salespe… Oja Moni… africa 35 98 10
13 Salespe… Pars… Emy africa 25 87 3
14 Salespe… Sall… Sue asia 30 150 10
15 Salespe… Smith Zeke asia 20 120 3
16 Salespe… Zeit… Moni… asia 23 184 1
17 Sr. Sal… Aames Sue africa 35 600 10
18 Sr. Sal… Black Hugh africa 40 261 9
19 Sr. Sal… Brown Barb asia 35 79 4
20 Sr. Sal… Carey Kate europe 20 69 2
21 Sr. Sal… Green Larry europe 20 113 4
22 Sr. Sal… Kuru… Jim usa 35 140 3
23 Sr. Sal… Wash… Joe europe 33 370 2
24 Sr. Sal… Zeit… Kate usa 50 187 4
# We can get even more fine grained. This time we will arrange# by title, then by region, then by last names and finally by first names.## Notice that the names are alphabetically arranged only with a group of# rows that all have the same title and region.sales |>select(title, region, last, first, everything()) |>arrange(title, region, last, first) |>print(n=Inf)
# A tibble: 24 × 7
title region last first baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Manager africa White Laura 20 281 8
2 Manager asia Zeit… Emy 34 166 4
3 Manager europe Linc… Sam 30 500 2
4 Manager europe Mart… Jack 20 40 3
5 Manager usa Cohen Zeke 30 250 10
6 Manager usa Smith Joe 40 100 10
7 Salespe… africa Oja Moni… 35 98 10
8 Salespe… africa Pars… Emy 25 87 3
9 Salespe… asia Mark… Amy 46 340 3
10 Salespe… asia Sall… Sue 30 150 10
11 Salespe… asia Smith Zeke 20 120 3
12 Salespe… asia Zeit… Moni… 23 184 1
13 Salespe… europe Brown Jim 50 167 2
14 Salespe… usa Aames Barb 21 255 7
15 Salespe… usa Aames Jack 43 105 4
16 Salespe… usa Lee Sam 40 200 6
17 Sr. Sal… africa Aames Sue 35 600 10
18 Sr. Sal… africa Black Hugh 40 261 9
19 Sr. Sal… asia Brown Barb 35 79 4
20 Sr. Sal… europe Carey Kate 20 69 2
21 Sr. Sal… europe Green Larry 20 113 4
22 Sr. Sal… europe Wash… Joe 33 370 2
23 Sr. Sal… usa Kuru… Jim 35 140 3
24 Sr. Sal… usa Zeit… Kate 50 187 4
26.4.2 desc(SOME_COLUMN) - arrange in descending order
Use the desc function to specify “descending” order. See the examples.
# In the following, the rows with the highest values# of salesInThousands are at the top.sales |>filter(salesInThousands <100) |>select(first, last, salesInThousands) |>arrange(desc(salesInThousands)) |>print(n=Inf)
# A tibble: 5 × 3
first last salesInThousands
<chr> <chr> <dbl>
1 Monica Oja 98
2 Emy Parsons 87
3 Barb Brown 79
4 Kate Carey 69
5 Jack Martin 40
# Arrange alphabetically by the region.# Within each region, show the sales in decreasing order.sales |>select(region, baseInThousands, everything()) |>arrange(region, desc(baseInThousands)) |>print(n=Inf)
# A tibble: 24 × 7
region baseInThousands first last title salesInThousands yearsWithCompany
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 africa 40 Hugh Black Sr. … 261 9
2 africa 35 Monica Oja Sale… 98 10
3 africa 35 Sue Aames Sr. … 600 10
4 africa 25 Emy Parsons Sale… 87 3
5 africa 20 Laura White Mana… 281 8
6 asia 46 Amy Markov… Sale… 340 3
7 asia 35 Barb Brown Sr. … 79 4
8 asia 34 Emy Zeitch… Mana… 166 4
9 asia 30 Sue Sallin… Sale… 150 10
10 asia 23 Monica Zeitch… Sale… 184 1
11 asia 20 Zeke Smith Sale… 120 3
12 europe 50 Jim Brown Sale… 167 2
13 europe 33 Joe Washin… Sr. … 370 2
14 europe 30 Sam Lincoln Mana… 500 2
15 europe 20 Jack Martin Mana… 40 3
16 europe 20 Kate Carey Sr. … 69 2
17 europe 20 Larry Green Sr. … 113 4
18 usa 50 Kate Zeitch… Sr. … 187 4
19 usa 43 Jack Aames Sale… 105 4
20 usa 40 Joe Smith Mana… 100 10
21 usa 40 Sam Lee Sale… 200 6
22 usa 35 Jim Kuruzo… Sr. … 140 3
23 usa 30 Zeke Cohen Mana… 250 10
24 usa 21 Barb Aames Sale… 255 7
# Show the names, titles and salesInThousands for# people who sell to the usa and for people who sell to europe.## Arrange the results so that all titles are grouped together.# Within the rows of a particular title sort the results in # descending order based on the salesInThousands so that the # rows with greater salesInThousands appear earlier.## Only show salespeople for the usa and europesales |>select(first, last, title, salesInThousands, region) |>filter(region %in%c("usa", "europe")) |>arrange(title, desc(salesInThousands))
# A tibble: 13 × 5
first last title salesInThousands region
<chr> <chr> <chr> <dbl> <chr>
1 Sam Lincoln Manager 500 europe
2 Zeke Cohen Manager 250 usa
3 Joe Smith Manager 100 usa
4 Jack Martin Manager 40 europe
5 Barb Aames Salesperson 255 usa
6 Sam Lee Salesperson 200 usa
7 Jim Brown Salesperson 167 europe
8 Jack Aames Salesperson 105 usa
9 Joe Washington Sr. Salesperson 370 europe
10 Kate Zeitchik Sr. Salesperson 187 usa
11 Jim Kuruzovich Sr. Salesperson 140 usa
12 Larry Green Sr. Salesperson 113 europe
13 Kate Carey Sr. Salesperson 69 europe
# Careful - if you do not include region in the select , it won't work# ERROR - object 'region' not foundsales |>select(first, last, title, salesInThousands) |># region is missing herefilter(region %in%c("usa", "europe")) |># but you need it herearrange(title, desc(salesInThousands))
Error in `filter()`:
ℹ In argument: `region %in% c("usa", "europe")`.
Caused by error:
! object 'region' not found
# Rearrange the order of the function calls to get it to worksales |>arrange(title, desc(salesInThousands)) |>filter(region %in%c("usa", "europe")) |># we use region hereselect(first, last, title, salesInThousands) # we can now exclude region
# A tibble: 13 × 4
first last title salesInThousands
<chr> <chr> <chr> <dbl>
1 Sam Lincoln Manager 500
2 Zeke Cohen Manager 250
3 Joe Smith Manager 100
4 Jack Martin Manager 40
5 Barb Aames Salesperson 255
6 Sam Lee Salesperson 200
7 Jim Brown Salesperson 167
8 Jack Aames Salesperson 105
9 Joe Washington Sr. Salesperson 370
10 Kate Zeitchik Sr. Salesperson 187
11 Jim Kuruzovich Sr. Salesperson 140
12 Larry Green Sr. Salesperson 113
13 Kate Carey Sr. Salesperson 69
26.5 mutate - add new columns, modify columns, delete columns
Use the mutate function to add new columns. These new columns can be based on existing data. See the examples below.
# All salespeople get a commission equal to 10% of their sales.# Create a new column with the name commission that shows the value of their commission.# The following code creates the new column as the last column in the tibble.sales |>mutate(commission=0.10*salesInThousands, takeHome=baseInThousands +0.10*salesInThousands ) |>print(width=Inf)
# A tibble: 24 × 9
first last title region baseInThousands salesInThousands
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Joe Smith Manager usa 40 100
2 Sam Lee Salesperson usa 40 200
3 Sue Sallinger Salesperson asia 30 150
4 Barb Brown Sr. Salesperson asia 35 79
5 Jack Martin Manager europe 20 40
6 Kate Carey Sr. Salesperson europe 20 69
7 Emy Parsons Salesperson africa 25 87
8 Monica Oja Salesperson africa 35 98
9 Jim Kuruzovich Sr. Salesperson usa 35 140
10 Zeke Cohen Manager usa 30 250
yearsWithCompany commission takeHome
<dbl> <dbl> <dbl>
1 10 10 50
2 6 20 60
3 10 15 45
4 4 7.9 42.9
5 3 4 24
6 2 6.9 26.9
7 3 8.7 33.7
8 10 9.8 44.8
9 3 14 49
10 10 25 55
# ℹ 14 more rows
26.5.1 .before and .after
Use the .before and .after argument to mutate to specify where the new columns should appear in the tibble. See the examples below.
# Place the newly created column at the beginning of the tibble,# i.e. .before the 1st column.sales |>mutate(commission=0.10*salesInThousands, .before=1)
# A tibble: 24 × 8
commission first last title region baseInThousands salesInThousands
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 10 Joe Smith Manager usa 40 100
2 20 Sam Lee Salespe… usa 40 200
3 15 Sue Sallinger Salespe… asia 30 150
4 7.9 Barb Brown Sr. Sal… asia 35 79
5 4 Jack Martin Manager europe 20 40
6 6.9 Kate Carey Sr. Sal… europe 20 69
7 8.7 Emy Parsons Salespe… africa 25 87
8 9.8 Monica Oja Salespe… africa 35 98
9 14 Jim Kuruzovich Sr. Sal… usa 35 140
10 25 Zeke Cohen Manager usa 30 250
# ℹ 14 more rows
# ℹ 1 more variable: yearsWithCompany <dbl>
# Place the newly created column as the 3rd column# i.e. .after the 2nd column.# (.before=4 would also work)sales |>mutate(commission=0.10*salesInThousands, .after=2)
# A tibble: 24 × 8
first last commission title region baseInThousands salesInThousands
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 Joe Smith 10 Manager usa 40 100
2 Sam Lee 20 Salespe… usa 40 200
3 Sue Sallinger 15 Salespe… asia 30 150
4 Barb Brown 7.9 Sr. Sal… asia 35 79
5 Jack Martin 4 Manager europe 20 40
6 Kate Carey 6.9 Sr. Sal… europe 20 69
7 Emy Parsons 8.7 Salespe… africa 25 87
8 Monica Oja 9.8 Salespe… africa 35 98
9 Jim Kuruzovich 14 Sr. Sal… usa 35 140
10 Zeke Cohen 25 Manager usa 30 250
# ℹ 14 more rows
# ℹ 1 more variable: yearsWithCompany <dbl>
# .before and .after can also refer to specific columns# For example the following also places the commission after the last name columnsales |>mutate(commission=0.10*salesInThousands, .after=last)
# A tibble: 24 × 8
first last commission title region baseInThousands salesInThousands
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 Joe Smith 10 Manager usa 40 100
2 Sam Lee 20 Salespe… usa 40 200
3 Sue Sallinger 15 Salespe… asia 30 150
4 Barb Brown 7.9 Sr. Sal… asia 35 79
5 Jack Martin 4 Manager europe 20 40
6 Kate Carey 6.9 Sr. Sal… europe 20 69
7 Emy Parsons 8.7 Salespe… africa 25 87
8 Monica Oja 9.8 Salespe… africa 35 98
9 Jim Kuruzovich 14 Sr. Sal… usa 35 140
10 Zeke Cohen 25 Manager usa 30 250
# ℹ 14 more rows
# ℹ 1 more variable: yearsWithCompany <dbl>
26.5.2 a new column that depends on other new columns
The following code creates two new columns.
The commission column is created as in the last example.
The takeHome column is the total take home pay for the salesperson, i.e. their baseInThousands plus the commission.
If you create both columns in a single call to mutate, you will need to repeat the calculation for the commission twice (see examples below). An alternative is to call the mutate function twice. For tibbles with many rows this can take longer to process but is less error prone when you write the code.
# All salespeople get a commission equal to 10% of their sales.# Create a new column with the name commission that shows the value of their commission.# Create another column called takeHome which has their total takehome pay.# We can do it all in one call to mutate.# However, that require repeating the code for calculating for the commission.sales |>mutate(commission=0.10*salesInThousands, takeHome=baseInThousands +0.10*salesInThousands, .before=1 ) |>print(width=Inf)
# A tibble: 24 × 9
commission takeHome first last title region baseInThousands
<dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 10 50 Joe Smith Manager usa 40
2 20 60 Sam Lee Salesperson usa 40
3 15 45 Sue Sallinger Salesperson asia 30
4 7.9 42.9 Barb Brown Sr. Salesperson asia 35
5 4 24 Jack Martin Manager europe 20
6 6.9 26.9 Kate Carey Sr. Salesperson europe 20
7 8.7 33.7 Emy Parsons Salesperson africa 25
8 9.8 44.8 Monica Oja Salesperson africa 35
9 14 49 Jim Kuruzovich Sr. Salesperson usa 35
10 25 55 Zeke Cohen Manager usa 30
salesInThousands yearsWithCompany
<dbl> <dbl>
1 100 10
2 200 6
3 150 10
4 79 4
5 40 3
6 69 2
7 87 3
8 98 10
9 140 3
10 250 10
# ℹ 14 more rows
# By separating the code into two calls to mutate, we can refer to the # newly created "commission" column when calculating the takeHome column.# Note that the 2nd call to mutate below receives a tibble that already has a# commission column as the first column. That is why .before=2 (and not .before=1)sales |>mutate(commission=0.10*salesInThousands, .before=1) |>mutate(takeHome=baseInThousands + commission, .before=2) |>print(width=Inf)
# A tibble: 24 × 9
commission takeHome first last title region baseInThousands
<dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 10 50 Joe Smith Manager usa 40
2 20 60 Sam Lee Salesperson usa 40
3 15 45 Sue Sallinger Salesperson asia 30
4 7.9 42.9 Barb Brown Sr. Salesperson asia 35
5 4 24 Jack Martin Manager europe 20
6 6.9 26.9 Kate Carey Sr. Salesperson europe 20
7 8.7 33.7 Emy Parsons Salesperson africa 25
8 9.8 44.8 Monica Oja Salesperson africa 35
9 14 49 Jim Kuruzovich Sr. Salesperson usa 35
10 25 55 Zeke Cohen Manager usa 30
salesInThousands yearsWithCompany
<dbl> <dbl>
1 100 10
2 200 6
3 150 10
4 79 4
5 40 3
6 69 2
7 87 3
8 98 10
9 140 3
10 250 10
# ℹ 14 more rows
26.5.3 modifying existing columns
You can modify existing columns by specifying an existing column name in mutate.
26.5.4 deleting columns with mutate
You can delete existing columns by setting the value of the column to NULL.
26.6 distinct()
The distinct function eliminates any duplicate rows from a tibble. Only rows for which ALL of the values are equal to ALL the values in another row will be eliminated.
See the examples.
# Show the just the region for each row in the tibblesales |>select (region) |>print(n=Inf)
# A tibble: 24 × 1
region
<chr>
1 usa
2 usa
3 asia
4 asia
5 europe
6 europe
7 africa
8 africa
9 usa
10 usa
11 asia
12 asia
13 europe
14 europe
15 africa
16 usa
17 usa
18 usa
19 asia
20 asia
21 europe
22 europe
23 africa
24 africa
# Same thing using distinct()# Show just the distinct (ie. different) regionssales |>select (region) |>distinct() |># show just the distinct rowsprint(n=Inf)
# A tibble: 4 × 1
region
<chr>
1 usa
2 asia
3 europe
4 africa
################################################################ NOTE - all values in the row must be exactly the same# for the distinct() function to eliminate the duplicate rows.################################################################ Show the region and title for rows # whose region is either europe or asia.# Arrange the output so they come up sorted.sales |>select(region, title) |>filter(region %in%c('europe', 'asia')) |>arrange(region, title) |>print(n=Inf)
# A tibble: 12 × 2
region title
<chr> <chr>
1 asia Manager
2 asia Salesperson
3 asia Salesperson
4 asia Salesperson
5 asia Salesperson
6 asia Sr. Salesperson
7 europe Manager
8 europe Manager
9 europe Salesperson
10 europe Sr. Salesperson
11 europe Sr. Salesperson
12 europe Sr. Salesperson
# Same thing using distinct()# Show just the distinct rows from the previous commandsales |>select(region, title) |>filter(region %in%c('europe', 'asia')) |>distinct() |># eliminate duplicate rowsarrange(region, title) |>print(n=Inf)
# A tibble: 6 × 2
region title
<chr> <chr>
1 asia Manager
2 asia Salesperson
3 asia Sr. Salesperson
4 europe Manager
5 europe Salesperson
6 europe Sr. Salesperson
# Notice that in the previous output, there were multiple# rows that contained "europe", multiple rows that contained "asia"# as well as multiple rows that contained each of the different titles.# However - there were NO rows that contained exact duplicates of both# the region and the title.## The moral of the story is that distinct() will only elminate rows # that are EXACT duplicates in EVERY column.
# A tibble: 24 × 7
region title first last baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 africa Manager Laura White 20 281 8
2 africa Salespe… Emy Pars… 25 87 3
3 africa Salespe… Moni… Oja 35 98 10
4 africa Sr. Sal… Sue Aames 35 600 10
5 africa Sr. Sal… Hugh Black 40 261 9
6 asia Manager Emy Zeit… 34 166 4
7 asia Salespe… Sue Sall… 30 150 10
8 asia Salespe… Zeke Smith 20 120 3
9 asia Salespe… Amy Mark… 46 340 3
10 asia Salespe… Moni… Zeit… 23 184 1
# ℹ 14 more rows
# A tibble: 24 × 2
region title
<chr> <chr>
1 africa Manager
2 africa Salesperson
3 africa Salesperson
4 africa Sr. Salesperson
5 africa Sr. Salesperson
6 asia Manager
7 asia Salesperson
8 asia Salesperson
9 asia Salesperson
10 asia Salesperson
# ℹ 14 more rows
# A tibble: 12 × 2
region title
<chr> <chr>
1 africa Manager
2 africa Salesperson
3 africa Sr. Salesperson
4 asia Manager
5 asia Salesperson
6 asia Sr. Salesperson
7 europe Manager
8 europe Salesperson
9 europe Sr. Salesperson
10 usa Manager
11 usa Salesperson
12 usa Sr. Salesperson
26.7 summarize and group_by
The summarize() and group_by() functions work together. We’ll explain exactly how below …
26.7.1 aggregate functions
The summarize function is used to condense the contents of several rows of data into a single row of data. It does this by using functions such as min(), max(), mean(), etc. These functions can take several values and return a single value. Such functions are known as aggregate functions.
26.7.2 summarize (without group_by)
The summarize function is often preceded by a call to the group_by() function (which we will cover in the next section). group_by() is used to contol which rows of data will be affected by a subsequent call to the summarize() function. This will be explained in more detail below. For now, we will just focus on how summarize() works when it is NOT preceded by a call to group_by().
When summarize() is NOT preceded by a call to group_by() then the job of the summarize() function is to return a single row of summary by applying aggregate functions (such as min, max, mean, etc) to columns of the tibble.
See the examples below.
# mean salesInThousands (for all rows)# Note that this returns a tibble with exactly one row and one column.sales |>summarize(mean(salesInThousands))
# A tibble: 1 × 1
`mean(salesInThousands)`
<dbl>
1 203.
# We can change the column name as shown belowsales |>summarize(averageSales =mean(salesInThousands))
# A tibble: 1 × 1
averageSales
<dbl>
1 203.
# We can filter the rows before calculating the mean results (or use any other of the dplyr functions)## Show the mean sales for just the USA.sales |>filter(region =="usa") |>summarize(usaSales =mean(salesInThousands))
# A tibble: 1 × 1
usaSales
<dbl>
1 177.
# We can get more than one column in our summary tibblesales |>summarize(meanSales =mean(salesInThousands), maxSales=max(salesInThousands), minSales=min(salesInThousands),maxBase =max(baseInThousands), minBase=min(baseInThousands))
The n_distinct() function returns the number of rows that have distinct (i.e. different) values.
# Show all of the rows ordered by region and titlesales |>select(region, title, everything()) |>print(n=Inf)
# A tibble: 24 × 7
region title first last baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 usa Manager Joe Smith 40 100 10
2 usa Salespe… Sam Lee 40 200 6
3 asia Salespe… Sue Sall… 30 150 10
4 asia Sr. Sal… Barb Brown 35 79 4
5 europe Manager Jack Mart… 20 40 3
6 europe Sr. Sal… Kate Carey 20 69 2
7 africa Salespe… Emy Pars… 25 87 3
8 africa Salespe… Moni… Oja 35 98 10
9 usa Sr. Sal… Jim Kuru… 35 140 3
10 usa Manager Zeke Cohen 30 250 10
11 asia Salespe… Zeke Smith 20 120 3
12 asia Salespe… Amy Mark… 46 340 3
13 europe Sr. Sal… Joe Wash… 33 370 2
14 europe Manager Sam Linc… 30 500 2
15 africa Sr. Sal… Sue Aames 35 600 10
16 usa Salespe… Barb Aames 21 255 7
17 usa Salespe… Jack Aames 43 105 4
18 usa Sr. Sal… Kate Zeit… 50 187 4
19 asia Manager Emy Zeit… 34 166 4
20 asia Salespe… Moni… Zeit… 23 184 1
21 europe Salespe… Jim Brown 50 167 2
22 europe Sr. Sal… Larry Green 20 113 4
23 africa Manager Laura White 20 281 8
24 africa Sr. Sal… Hugh Black 40 261 9
# n() - returns the number of rows in the tibble# n_distinct(COL1, COL2, ...) - returns the number of rows for which # the specified columns taken all together are distinct among all other rows.sales |>select(region, title, everything()) |>summarize(n(), n_distinct(region), n_distinct(title), n_distinct(region, title))
The examples above for summarize all return a single row of data. This is because they are summarizing all of the rows for the tibble that summarize received. Note that this is true even if the rows have been filtered first. For example:
# The following returns just one row.# The summarize function is summarizing ALL of the rows # for the tibble that it was given.sales %>%filter(region =="usa") |>summarize(numberOfRows=n(), maxUsaSales =max(salesInThousands), minUsaSales=min(salesInThousands))
The functions, group_by and summarize functions are designed to work together. The job of the group_by function is to separate the rows of data into different “groups” that will later be processed separately by the summarize function.
If you use a group_by function in general it is followed a call to summarize()
To understand the examples below, it is helpful to first see the rows sorted by region and title
# A tibble: 24 × 7
region title first last baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 africa Manager Laura White 20 281 8
2 africa Salespe… Emy Pars… 25 87 3
3 africa Salespe… Moni… Oja 35 98 10
4 africa Sr. Sal… Sue Aames 35 600 10
5 africa Sr. Sal… Hugh Black 40 261 9
6 asia Manager Emy Zeit… 34 166 4
7 asia Salespe… Sue Sall… 30 150 10
8 asia Salespe… Zeke Smith 20 120 3
9 asia Salespe… Amy Mark… 46 340 3
10 asia Salespe… Moni… Zeit… 23 184 1
# ℹ 14 more rows
Notice how the rows that have the same region can be thought of as a “group” of rows for that region.
See the examples below
# Show summary information for each of the different regions.# Each row in the output corresponds only to the rows for the region shown on that row of output.sales |>group_by(region) |># create a different group for each regionsummarize(numberOfRows=n(), mean(baseInThousands), mean(salesInThousands))
# A tibble: 4 × 4
region numberOfRows `mean(baseInThousands)` `mean(salesInThousands)`
<chr> <int> <dbl> <dbl>
1 africa 5 31 265.
2 asia 6 31.3 173.
3 europe 6 28.8 210.
4 usa 7 37 177.
Similarly we can group the rows by those with the same value in the title column.
# Show summary information for each of the different titles.# Each row in the output corresponds only to the rows for the title shown on that row of output.sales |>group_by(title) |># create a different group for each titlesummarize(numberOfRows=n(), mean(baseInThousands), mean(salesInThousands))
26.8.1 group_by without summarize has no noticeable effect
Note that on its own, the group_by function doesn’t appear to do anything. It only has a tangible effect when a summarize function is called after the group_by function
# group_by without a subsequent call to summarize doesn't appear to do anything specialsales |>group_by(title)
# A tibble: 24 × 7
# Groups: title [3]
first last title region baseInThousands salesInThousands yearsWithCompany
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Joe Smith Mana… usa 40 100 10
2 Sam Lee Sale… usa 40 200 6
3 Sue Sallin… Sale… asia 30 150 10
4 Barb Brown Sr. … asia 35 79 4
5 Jack Martin Mana… europe 20 40 3
6 Kate Carey Sr. … europe 20 69 2
7 Emy Parsons Sale… africa 25 87 3
8 Monica Oja Sale… africa 35 98 10
9 Jim Kuruzo… Sr. … usa 35 140 3
10 Zeke Cohen Mana… usa 30 250 10
# ℹ 14 more rows
26.8.2 group_by more than one column
A call to group_by that has more than one column creates a separate group for the rows that have the same values for all of the specified columns. To understand this look at the following output:
# A tibble: 12 × 3
region title baseInThousands
<chr> <chr> <dbl>
1 asia Manager 34
2 asia Salesperson 30
3 asia Salesperson 20
4 asia Salesperson 46
5 asia Salesperson 23
6 asia Sr. Salesperson 35
7 europe Manager 20
8 europe Manager 30
9 europe Salesperson 50
10 europe Sr. Salesperson 20
11 europe Sr. Salesperson 33
12 europe Sr. Salesperson 20
We can treat the rows that have the same values for both the region and title as a single group of rows. For example we can treat all of the rows for “asia Salesperson” as a group and the rows for “eroupe Manager” as a separate group. The output of the following command returns one row for each group.
# create groups based on region and salesasia_europe_sales |>group_by(region, title) |>summarize(n(), max(baseInThousands), mean(baseInThousands))|>print(n=Inf)
`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.
# A tibble: 6 × 5
# Groups: region [2]
region title `n()` `max(baseInThousands)` `mean(baseInThousands)`
<chr> <chr> <int> <dbl> <dbl>
1 asia Manager 1 34 34
2 asia Salesperson 4 46 29.8
3 asia Sr. Salesperson 1 35 35
4 europe Manager 2 30 25
5 europe Salesperson 1 50 50
6 europe Sr. Salesperson 3 33 24.3
Note
You might be wondering about the somewhat cryptic message from the previous command:
summarise() has grouped output by ‘region’. You can override using the .groups argument.
When group_by is called with more than one column name thesubsequent summarize call will generate a message similar to the one shownabove. You can ignore this message without any issue. If you want to understandit better - look at this.
26.8.3 order of columns in group_by is irrelevant
Note that it makes no difference which order you specify the column names in the call to group by. The following command reverses the order of the columns in the call to group_by but returns the exact same result as the previous command.
# create groups based on region and salesasia_europe_sales |>group_by(title, region) |># the order of the columns in group_by makes no differencesummarize(n(), max(baseInThousands), mean(baseInThousands))|>print(n=Inf)
`summarise()` has grouped output by 'title'. You can override using the
`.groups` argument.
# A tibble: 6 × 5
# Groups: title [3]
title region `n()` `max(baseInThousands)` `mean(baseInThousands)`
<chr> <chr> <int> <dbl> <dbl>
1 Manager asia 1 34 34
2 Manager europe 2 30 25
3 Salesperson asia 4 46 29.8
4 Salesperson europe 1 50 50
5 Sr. Salesperson asia 1 35 35
6 Sr. Salesperson europe 3 33 24.3
26.8.4 some other aggregate functions
As mentioned above, an “aggregate function” is a function that can take several values as input but returns a single value. It is these types of functions that are used by summarize()
The following are some other commonly used aggregate functions that can be used in summarize. There are many, many others.