24  24. dplyr - getting and viewing the data

The dplyr R package is part of the tidyverse family of packages. It is used to “slice and dice” tibbles to get exactly the data you want, in the form that you want it.

The dplyr R package performs the same kinds of tasks that the “SQL select statement” performs. We’ll learn more about SQL later.

Sources for info about the dplyr package:

24.1 Get the data

The data we are using in this section contains information about salespeople who are employees of a company. Each row of the data contains info about one salesperson. The salespeople get paid some “base pay” as well as a commission that is a percent of the total dollar amount of sales they make.

24.1.1 Load the R packages we’ll neeed

if(!require(tibble)){install.packages("tibble");require(tibble);}
if(!require(dplyr)){install.packages("dplyr");require(dplyr);}
if(!require(readr)){install.packages("readr");require(readr);}

24.1.2 Download the file

The data is contained in a csv file. If you’d like to follow along with this tutorial on your own computer you can download the .csv file we are using by clicking here

24.1.3 dplyr is best used with tibbles

The majority of the dplyr functions work with tibbles or dataframes. However, the tidyverse in general and dplyr specifically is designed to work better with tibbles.

If you need to, you can always convert a tibble to a dataframe or a dataframe to a tibble.

24.1.4 Import the data by clicking on some buttons …

  • The code in the next section below uses read_csv function to read the data into R. If you are not comfortable with R, I recommend that you instead, follow the instructions starting in the next bullet to import the data into R.

  • If you are not familiar with R, you may have some trouble running the read_csv() code shown below. Instead, I recommend that you follow the following instructions to import the file into R.

    To do so, click on “Import Dataset” button on the “Environment” tab (usually found in the upper-right-hand window pane in RStudio).

    Choose “From Text (base)” and locate your file. You should see something like this:

    Make sure to change the “Name” portion (see circled section in picture) to read “sales”, then press the “import” button. This will open up a new tab in RStudio that shows the contents of the file. You can safely navigate away from this tab or close the tab and the data will remain imported and can be seen by typing “sales” (without the “quotes”).

  • Finally, in order to ensure that you are working with a tibble and not with a data.frame you can run the following R code:

    sales = as_tibble(sales)

24.1.5 Import the data by typing some R code …

The following code reads the data into R. Alternatively, you can follow the instructions above to click on some buttons to import the data.

# Read in the data into a tibble
#
# Note that the following code uses the readr::read_csv function from the readr
# package which is part of the tidyverse collection of packages. 
# This function is similar to the base-R read.csv function.
# 
# read_csv returns a tibble, which is the data structure that the 
# tidyverse packages use in lieu of dataframes. A tibble is basically
# a dataframe with extra features.
# By contrast, the base-r read.csv function returns a dataframe.
 
sales = read_csv("salespeople-v002.csv", na=c("","NULL"), show_col_types=FALSE)

24.2 Display the data

Since the data is in a tibble, by default, only the first 10 rows are displayed. In addition, only the columns that fit on the screen will be displayed. If the rows are too wide for the screen, then some columns may not be displayed and/or the contents of some columns may be shortened.

# Display the first few rows and columns of the sales data
sales
# A tibble: 24 × 7
   first  last    title region baseInThousands salesInThousands yearsWithCompany
   <chr>  <chr>   <chr> <chr>            <dbl>            <dbl>            <dbl>
 1 Joe    Smith   Mana… usa                 40              100               10
 2 Sam    Lee     Sale… usa                 40              200                6
 3 Sue    Sallin… Sale… asia                30              150               10
 4 Barb   Brown   Sr. … asia                35               79                4
 5 Jack   Martin  Mana… europe              20               40                3
 6 Kate   Carey   Sr. … europe              20               69                2
 7 Emy    Parsons Sale… africa              25               87                3
 8 Monica Oja     Sale… africa              35               98               10
 9 Jim    Kuruzo… Sr. … usa                 35              140                3
10 Zeke   Cohen   Mana… usa                 30              250               10
# ℹ 14 more rows

24.3 Display more rows/columns of the tibble

24.4 Convert between tibbles and dataframes

The majority of the dplyr functions work with tibbles or dataframes. However, the tidyverse in general and dplyr specifically is designed to work better with tibbles.

If you need to, you can always convert a tibble to a dataframe or a dataframe to a tibble.

# Convert the tibble to a data.frame
dfSales = as.data.frame(sales)

# Convert a dataframe to a tibble
tblSales = as_tibble(dfSales)

# A tibble is also a dataframe.
# This is similar to how a square is also a rectangle.
# You can see that from the class of the variable. 
# The class of a tibble contains "tbl_df" "tbl" as well as "data.frame"
class(tblSales)
[1] "tbl_df"     "tbl"        "data.frame"
# However, a dataframe is not necessarily a tibble.
# This is similar ot how a rectangle is not necessarily a square.
# Notice how the class of dfSales is only "data.frame"
class(dfSales)
[1] "data.frame"
# The head function is designed to work with dataframes
head(dfSales, 3)
  first      last       title region baseInThousands salesInThousands
1   Joe     Smith     Manager    usa              40              100
2   Sam       Lee Salesperson    usa              40              200
3   Sue Sallinger Salesperson   asia              30              150
  yearsWithCompany
1               10
2                6
3               10
# Therefore the head function also works with tibbles
head(tblSales, 3)
# A tibble: 3 × 7
  first last      title region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>     <chr> <chr>            <dbl>            <dbl>            <dbl>
1 Joe   Smith     Mana… usa                 40              100               10
2 Sam   Lee       Sale… usa                 40              200                6
3 Sue   Sallinger Sale… asia                30              150               10
#######################################################################
# However, features of some functions that are designed to
# work with tibbles might not work with data.frames.
#######################################################################
# This works fine - display the first 3 rows of the tibble.
print(tblSales, n=3) 
# A tibble: 24 × 7
  first last      title region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>     <chr> <chr>            <dbl>            <dbl>            <dbl>
1 Joe   Smith     Mana… usa                 40              100               10
2 Sam   Lee       Sale… usa                 40              200                6
3 Sue   Sallinger Sale… asia                30              150               10
# ℹ 21 more rows
# ERROR - n argument is not designed to work with dataframes that aren't tibbles
print(dfSales, n=3) 
Error in print.default(m, ..., quote = quote, right = right, max = max): invalid 'na.print' specification

24.5 getting “slices” of the rows

Among some of the many functions included in dplyr are the various “slice_…” functions. These functions return a new tibble with just the specified rows. The new tibble can be saved in a variable. This is different from how the print function works (see above).

The functions can take several arguments that modify their behavior. The following are just a few examples. See the ?slice for more info about these functions and their arguments.

24.5.1 slice_sample()

slice_sample() chooses a random sample of the rows.

# see 3 randomly chosen rows
sales |> slice_sample(n=3)  
# A tibble: 3 × 7
  first last    title   region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>   <chr>   <chr>            <dbl>            <dbl>            <dbl>
1 Emy   Parsons Salesp… africa              25               87                3
2 Hugh  Black   Sr. Sa… africa              40              261                9
3 Kate  Carey   Sr. Sa… europe              20               69                2
# randomly chose 10% of the rows to view
sales |> slice_sample(prop=0.10)  
# A tibble: 2 × 7
  first last      title region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>     <chr> <chr>            <dbl>            <dbl>            <dbl>
1 Barb  Aames     Sale… usa                 21              255                7
2 Sue   Sallinger Sale… asia                30              150               10

24.5.2 slice_min() and slice_max()

To see the rows that correspond to the min or max values of a specified column use slice_min()/slice_max().

# see rows with 3 smallest salesInThousands
sales |> slice_min(order_by=salesInThousands, n=3)  
# A tibble: 3 × 7
  first last   title    region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>  <chr>    <chr>            <dbl>            <dbl>            <dbl>
1 Jack  Martin Manager  europe              20               40                3
2 Kate  Carey  Sr. Sal… europe              20               69                2
3 Barb  Brown  Sr. Sal… asia                35               79                4
# see rows with 3 smallest salesInThousands
sales |> slice_max(order_by=salesInThousands, n=3)  
# A tibble: 3 × 7
  first last      title region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>     <chr> <chr>            <dbl>            <dbl>            <dbl>
1 Sue   Aames     Sr. … africa              35              600               10
2 Sam   Lincoln   Mana… europe              30              500                2
3 Joe   Washingt… Sr. … europe              33              370                2

24.5.3 slice() to see specific rows

You don’t need c() to specify the rows. This is typical of the style of functions in the tidyverse.

# get rows 1,5,7 in a new dataframe
x |> slice(1,5,7)
# A tibble: 3 × 7
  first last    title   region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>   <chr>   <chr>            <dbl>            <dbl>            <dbl>
1 Joe   Smith   Manager usa                 40              100               10
2 Jack  Martin  Manager europe              20               40                3
3 Emy   Parsons Salesp… africa              25               87                3
# get all rows except for rows 5 through 22
x |> slice(-(5:22))
# A tibble: 6 × 7
  first last      title region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>     <chr> <chr>            <dbl>            <dbl>            <dbl>
1 Joe   Smith     Mana… usa                 40              100               10
2 Sam   Lee       Sale… usa                 40              200                6
3 Sue   Sallinger Sale… asia                30              150               10
4 Barb  Brown     Sr. … asia                35               79                4
5 Laura White     Mana… africa              20              281                8
6 Hugh  Black     Sr. … africa              40              261                9

24.5.4 slice_head() and slice_tail()

The slice_head() and slice_tail() are very similar to the base-R head() and tail() functions. They return the first few (slice_head) or last few (slice_tail) rows from a tibble.

# get just first 3 rows
sales |> slice_head(n=3)
# A tibble: 3 × 7
  first last      title region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>     <chr> <chr>            <dbl>            <dbl>            <dbl>
1 Joe   Smith     Mana… usa                 40              100               10
2 Sam   Lee       Sale… usa                 40              200                6
3 Sue   Sallinger Sale… asia                30              150               10

print(n=3) vs slice_head(3)

# get just first 3 rows
sales |> slice_head(n=3)
# A tibble: 3 × 7
  first last      title region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>     <chr> <chr>            <dbl>            <dbl>            <dbl>
1 Joe   Smith     Mana… usa                 40              100               10
2 Sam   Lee       Sale… usa                 40              200                6
3 Sue   Sallinger Sale… asia                30              150               10
# save that in a variable
x = sales |> slice_head(n=3)

# x only contains 3 rows
x
# A tibble: 3 × 7
  first last      title region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>     <chr> <chr>            <dbl>            <dbl>            <dbl>
1 Joe   Smith     Mana… usa                 40              100               10
2 Sam   Lee       Sale… usa                 40              200                6
3 Sue   Sallinger Sale… asia                30              150               10
# print appears to do the same thing
# but print only affects what is displayed to the screen, not 
# what is "returned" and can be saved in a variable.
y = sales |> print(n=3)
# A tibble: 24 × 7
  first last      title region baseInThousands salesInThousands yearsWithCompany
  <chr> <chr>     <chr> <chr>            <dbl>            <dbl>            <dbl>
1 Joe   Smith     Mana… usa                 40              100               10
2 Sam   Lee       Sale… usa                 40              200                6
3 Sue   Sallinger Sale… asia                30              150               10
# ℹ 21 more rows
# y still has all of the rows
y
# A tibble: 24 × 7
   first  last    title region baseInThousands salesInThousands yearsWithCompany
   <chr>  <chr>   <chr> <chr>            <dbl>            <dbl>            <dbl>
 1 Joe    Smith   Mana… usa                 40              100               10
 2 Sam    Lee     Sale… usa                 40              200                6
 3 Sue    Sallin… Sale… asia                30              150               10
 4 Barb   Brown   Sr. … asia                35               79                4
 5 Jack   Martin  Mana… europe              20               40                3
 6 Kate   Carey   Sr. … europe              20               69                2
 7 Emy    Parsons Sale… africa              25               87                3
 8 Monica Oja     Sale… africa              35               98               10
 9 Jim    Kuruzo… Sr. … usa                 35              140                3
10 Zeke   Cohen   Mana… usa                 30              250               10
# ℹ 14 more rows

base-R head() vs dplyr::slice_head()

When used alone, slice_head() and head() are very similar.

However, slice_head() and slice_tail(), as well as the other slice_* functions are designed to work well in conjunction with the dplyr::group_by() function (we will cover group_by() later). The head() and tail() functions are not.

The general rule is that if you are using several dplyr functions in a pipeline (i.e. %>%> or |>) you should opt for the dplyr functions instead of the base-R functions that might seem similar.

24.6 Other ways of viewing the data

# To see the entire tibble in a separate tab in RStudio
# you can use the following command
#
#View(sales)

# Show all column names and datatypes plus some actual data
glimpse(sales)

# Show just the column names (could be useful for a quick reminder when you need it)
colnames(sales)