if(!require(tibble)){install.packages("tibble");require(tibble);}
if(!require(dplyr)){install.packages("dplyr");require(dplyr);}
if(!require(readr)){install.packages("readr");require(readr);}
if(!require(sqldf)){install.packages("sqldf");require(sqldf);}
29 29. Intro to SQL Select (assumes some knowledge of dplyr)
29.1 NOTE - this page is for “IDS2040 - Data Management” (not “IDS1020 - Intro to IDS”)
NOTE - If you are in my “IDS1020 - Intro to IDS” class, this is NOT the page you should be reading. Rather, jump to the next section entitled “Relational Databases and SQL” and start from there.
This page is intended for my “IDS460 - Data Management” class. The info below assumes that you have some knowledge of the dplyr package in R (which was covered in the previous section). After you get a basic idea of what the SQL SELECT statement does and how it is very similar to the dplyr functions, you can move onto the next section entitled “Relational Databases and SQL” and continue with that information.
29.2 Get the data
The data we are using in this section contains information about salespeople who are employees of a company. Each row of the data contains info about one salesperson. The salespeople get paid some “base pay” as well as a commission that is a percent of the total dollar amount of sales they make.
Follow the instructions below to get the data and start using it in R.
29.2.1 Load the R packages we’ll neeed
29.2.2 Download the file
The data is contained in a csv file. If you’d like to follow along with this tutorial on your own computer you can download the .csv file we are using by clicking here.
29.2.4 Import the data by typing some R code …
The following code reads the data into R. Alternatively, you can follow the instructions above to click on some buttons to import the data.
# Read in the data into a tibble
#
# Note that the following code uses the readr::read_csv function from the readr
# package which is part of the tidyverse collection of packages.
# This function is similar to the base-R read.csv function.
#
# read_csv returns a tibble, which is the data structure that the
# tidyverse packages use in lieu of dataframes. A tibble is basically
# a dataframe with extra features.
# By contrast, the base-r read.csv function returns a dataframe.
= read_csv("salespeople-v002.csv", na=c("","NULL"), show_col_types=FALSE) sales
29.3 Databases and SQL
A database is an organized collection of data that is designed to be accessed and manipulated by computer programs. We’ll cover a little more about exactly what a database is and how it differs from Excel, R and other programs in the next section of this book. Below is just a brief explanation to give you some background before we start getting into the details of the SQL language that is used to communicate with databases.
A “relational database” is a type of database in which data is arranged in “tables” that are organized into “rows” and “columns”. A “relational database table” is very similar to an R dataframe or tibble. Relational databases are controlled and managed with software known as “Relational DataBase Management System” (RDBMS) software. Relational database technology dates back to the 1970s and has been around long before R has been around. Relational databases are used by just about every major company all around the world.
“Structured Query Language” (or SQL for short - pronounced “sequel”) is the standard language that is used to communicate with a DBMS to manage and manipulate the data in a Relational Database. SQL has many different commands. One of the most important SQL commands for data analysts to know is the SQL “SELECT” command. It is this command that is used to extract data from a database and organize the data into a desired form. We will focus in this section on an intro to the SQL SELECT statement.
29.4 Using SQL in R
As we said above, R is NOT a Database Management System. However SQL is a very popular language. Many technologists come to R, already having a deep knowledge of the SQL language. Therefore it is nice to know that the SQL language can also be used to manipulate R dataframes. This is possible because R dataframes are very, very similar in structure to Relational Database “tables” (i.e. they have rows, columns, column names and specific datatypes for each column).
The R “sqldf” package includes a function named “sqldf” which takes a SQL command as its argument. Wherever the SQL command refers to the name of a relational database “table”, sqldf runs the SQL command using the R dataframe with that name.
There are other more advanced ways for R to connect to SQL databases (DB) and database management systems (DBMS). We will cover the more advanced ways later. For now, sqldf will suit our purposes for this intro.
29.5 dplyr is very similar to the SQL SELECT command
The designers of dplyr got inspiration for many of their ideas from SQL. Therefore once you know the basics of dplyr it should be very easy to transfer that knowledge to learning the SQL SELECT statement.
The following examples show how the concepts you learned in the previous section about dplyr carry over to the SQL SELECT statement.
The English word “query” means a question. A SQL SELECT statement is often referred to as a database “query”. In essence a SQL SELECT statement in essence asks the database a question and gets back an answer.
29.6 SQL SELECT “clauses” compared with dplyr “functions”
While dplyr uses different “functions”, the SQL SELECT statement is comprised of different “clauses”. The clauses in the SQL SELECT statement are listed below. We will elaborate on the details of these clauses in the sections below.
SELECT
This is used to “select” the columns you want - similar to dplyr select function.
FROM
Used to specify which “tables” contain the information you will be working with. In dplyr this is accomplished via the first argument, .data, of each of the dplyr “verb” functions that we learned about in the previous section.
WHERE
Choose the rows you want. Directly analogous to the dplyr “filter” function.
GROUP BY
Directly analogous to the dplyr “group_by” function.
HAVING
This is also similar to the dplyr “filter” function. We will learn later how this differs from the SQL SELECT WHERE clause.
ORDER BY
Directly analogous to the dplyr “arrange” function
LIMIT
Directly analogous to the dplyr print(n=…) function or the dplyr slice_head(n=…) function.
29.7 Order of the clauses is important
A SQL SELECT statement may contain some or all of the above clauses. Only the SELECT clause is absolutely required. If the SQL SELECT statement contains more than one clause then whichever clauses do appear must appear in the order listed above.
29.8 Is SQL case sensitive? yes and no
Different SQL products treat case sensitivity differently. In general the names of tables and columns ARE case sensitive. The “keywords” of the language are generally NOT case sensitive. However, it is often the convention of many SQL books and references to show SQL keywords in UPPER CASE. We have done that below but it is not strictly necessary.
29.9 Whitespace in SQL commands is ignored
A SQL command may be written entirely on one line. However, extra whitespace (i.e. spaces, tabs, newlines) may be added to make the code more readable. It is common to start each SQL SELECT statement clause on a new line and to add extra whitespace to make the code more readable. In general we tried to follow that practice in the code.
29.10 Intro to ERDs - i.e. “Entity Relationship Diagrams”
An “Entity Relationship Diagram” (ERD) is a diagram that highlights the structure of the tables in a database (in database terminology tables are also known as “entity sets” or “entity types” - see below). The ERD does NOT show the data in the tables, just the structure of the tables. The following is an ERD for the sales data we have been using.
Note that so far we’ve only been using one table. An ERD gets more complicated when a database contains several tables. We will revisit ERDs again later.
erDiagram SALES { first character "employee first name" last character "employee last name" title character "employee title" region character "region that employee sells to" baseInThousands numeric "employee's base pay (in thousands)" salesInThousands numeric "total sales dollars (in thousands)" yearsWithCompany numeric "number years employee with company" }
- The table name is shown at the top
- Under the table name
- The 1st column contains the names of the table’s fields (i.e. columns)
- The 2nd column contains the datatype of the fields
- The 3rd column contains a description of the field
Some Entity Relationship Diagrams don’t contain as much information. Often an ERD will NOT contain a description of the fields. This is still a “valid” ERD, just not as descriptive.
erDiagram SALES { first character last character title character region character baseInThousands numeric salesInThousands numeric yearsWithCompany numeric }
In the extreme case, an ERD may only contain the names of the tables. This type of ERD is only really useful when the database contains multiple tables.
erDiagram SALES
29.11 Database terminology - fields, entities, etc
There are different terms used to refer to the rows and columns of a table, dataframe or tibble. Statisticians often refer to each row of a table as an observation and each column of a table as a variable.
In the world of databases, a column can be referred to as a “column” or a “field”. Rows are referred to as just “rows” but sometimes are referred to as “records” or “entities”. Often an entire table is referred to as an “entity” (technically a table, which is a collection of rows, is an “entity set” or an “entity type”).
The truth is that not many people use the word “entity” or “entity set”. However, an “Entity Relationship Diagram” is used to show the tables in a database and how the tables are “related” to each other.
In this section we are just focusing on a single table. In a later sections, we’ll learn more about “relationships” between multiple tables and how the relationships are displayed on an Entity Relationship Diagram.
29.12 SELECT and FROM clauses
# The SELECT clause specifies which columns you want.
# The FROM clause specifies the table (or tables) that contain the data.
sqldf("SELECT title, first, last
FROM sales")
title first last
1 Manager Joe Smith
2 Salesperson Sam Lee
3 Salesperson Sue Sallinger
4 Sr. Salesperson Barb Brown
5 Manager Jack Martin
6 Sr. Salesperson Kate Carey
7 Salesperson Emy Parsons
8 Salesperson Monica Oja
9 Sr. Salesperson Jim Kuruzovich
10 Manager Zeke Cohen
11 Salesperson Zeke Smith
12 Salesperson Amy Markovic
13 Sr. Salesperson Joe Washington
14 Manager Sam Lincoln
15 Sr. Salesperson Sue Aames
16 Salesperson Barb Aames
17 Salesperson Jack Aames
18 Sr. Salesperson Kate Zeitchik
19 Manager Emy Zeitchik
20 Salesperson Monica Zeitchik
21 Salesperson Jim Brown
22 Sr. Salesperson Larry Green
23 Manager Laura White
24 Sr. Salesperson Hugh Black
29.13 SELECT * FROM …
# SELECT * in SQL is similar to select(everything()) in dplyr.
# * is an abbreviation for all of the column names.
sqldf("SELECT *
FROM sales")
first last title region baseInThousands salesInThousands yearsWithCompany
1 Joe Smith Manager usa 40 100 10
2 Sam Lee Salesperson usa 40 200 6
3 Sue Sallinger Salesperson asia 30 150 10
4 Barb Brown Sr. Salesperson asia 35 79 4
5 Jack Martin Manager europe 20 40 3
6 Kate Carey Sr. Salesperson europe 20 69 2
7 Emy Parsons Salesperson africa 25 87 3
8 Monica Oja Salesperson africa 35 98 10
9 Jim Kuruzovich Sr. Salesperson usa 35 140 3
10 Zeke Cohen Manager usa 30 250 10
11 Zeke Smith Salesperson asia 20 120 3
12 Amy Markovic Salesperson asia 46 340 3
13 Joe Washington Sr. Salesperson europe 33 370 2
14 Sam Lincoln Manager europe 30 500 2
15 Sue Aames Sr. Salesperson africa 35 600 10
16 Barb Aames Salesperson usa 21 255 7
17 Jack Aames Salesperson usa 43 105 4
18 Kate Zeitchik Sr. Salesperson usa 50 187 4
19 Emy Zeitchik Manager asia 34 166 4
20 Monica Zeitchik Salesperson asia 23 184 1
21 Jim Brown Salesperson europe 50 167 2
22 Larry Green Sr. Salesperson europe 20 113 4
23 Laura White Manager africa 20 281 8
24 Hugh Black Sr. Salesperson africa 40 261 9
# Unfortunately, the SQL SELECT * is not as smart as the everything() function in dplyr.
# In dplyr, everything() does not include columns that you already typed.
# In SQL, the * always includes ALL columns. Therefore the following
# query displays the title and region columns a 2nd time due to the *
sqldf("SELECT title, region, *
FROM sales")
title region first last title region baseInThousands salesInThousands yearsWithCompany
1 Manager usa Joe Smith Manager usa 40 100 10
2 Salesperson usa Sam Lee Salesperson usa 40 200 6
3 Salesperson asia Sue Sallinger Salesperson asia 30 150 10
4 Sr. Salesperson asia Barb Brown Sr. Salesperson asia 35 79 4
5 Manager europe Jack Martin Manager europe 20 40 3
6 Sr. Salesperson europe Kate Carey Sr. Salesperson europe 20 69 2
7 Salesperson africa Emy Parsons Salesperson africa 25 87 3
8 Salesperson africa Monica Oja Salesperson africa 35 98 10
9 Sr. Salesperson usa Jim Kuruzovich Sr. Salesperson usa 35 140 3
10 Manager usa Zeke Cohen Manager usa 30 250 10
11 Salesperson asia Zeke Smith Salesperson asia 20 120 3
12 Salesperson asia Amy Markovic Salesperson asia 46 340 3
13 Sr. Salesperson europe Joe Washington Sr. Salesperson europe 33 370 2
14 Manager europe Sam Lincoln Manager europe 30 500 2
15 Sr. Salesperson africa Sue Aames Sr. Salesperson africa 35 600 10
16 Salesperson usa Barb Aames Salesperson usa 21 255 7
17 Salesperson usa Jack Aames Salesperson usa 43 105 4
18 Sr. Salesperson usa Kate Zeitchik Sr. Salesperson usa 50 187 4
19 Manager asia Emy Zeitchik Manager asia 34 166 4
20 Salesperson asia Monica Zeitchik Salesperson asia 23 184 1
21 Salesperson europe Jim Brown Salesperson europe 50 167 2
22 Sr. Salesperson europe Larry Green Sr. Salesperson europe 20 113 4
23 Manager africa Laura White Manager africa 20 281 8
24 Sr. Salesperson africa Hugh Black Sr. Salesperson africa 40 261 9
# The only thing you can do in a select statement without a FROM clause
# is to perform calculations without data from a table.
sqldf("SELECT 3+2, 100*5")
3+2 100*5
1 5 500
29.14 add new columns in SELECT clause
# In dplyr, to add new columns you use the mutate function.
# In the SQL SELECT command this is accomplished as part of the SELECT clause.
# To add new columns that are calculated from other columns
# simply add the calculations to the select clause.
sqldf("SELECT first, last, salesInThousands, 0.1 * salesInThousands
FROM sales
")
first last salesInThousands 0.1 * salesInThousands
1 Joe Smith 100 10.0
2 Sam Lee 200 20.0
3 Sue Sallinger 150 15.0
4 Barb Brown 79 7.9
5 Jack Martin 40 4.0
6 Kate Carey 69 6.9
7 Emy Parsons 87 8.7
8 Monica Oja 98 9.8
9 Jim Kuruzovich 140 14.0
10 Zeke Cohen 250 25.0
11 Zeke Smith 120 12.0
12 Amy Markovic 340 34.0
13 Joe Washington 370 37.0
14 Sam Lincoln 500 50.0
15 Sue Aames 600 60.0
16 Barb Aames 255 25.5
17 Jack Aames 105 10.5
18 Kate Zeitchik 187 18.7
19 Emy Zeitchik 166 16.6
20 Monica Zeitchik 184 18.4
21 Jim Brown 167 16.7
22 Larry Green 113 11.3
23 Laura White 281 28.1
24 Hugh Black 261 26.1
# You can give the new column a unique name by following the definition
# of the column with "AS columnName".
sqldf("SELECT first, last, salesInThousands, 0.1 * salesInThousands as commission
FROM sales
")
first last salesInThousands commission
1 Joe Smith 100 10.0
2 Sam Lee 200 20.0
3 Sue Sallinger 150 15.0
4 Barb Brown 79 7.9
5 Jack Martin 40 4.0
6 Kate Carey 69 6.9
7 Emy Parsons 87 8.7
8 Monica Oja 98 9.8
9 Jim Kuruzovich 140 14.0
10 Zeke Cohen 250 25.0
11 Zeke Smith 120 12.0
12 Amy Markovic 340 34.0
13 Joe Washington 370 37.0
14 Sam Lincoln 500 50.0
15 Sue Aames 600 60.0
16 Barb Aames 255 25.5
17 Jack Aames 105 10.5
18 Kate Zeitchik 187 18.7
19 Emy Zeitchik 166 16.6
20 Monica Zeitchik 184 18.4
21 Jim Brown 167 16.7
22 Larry Green 113 11.3
23 Laura White 281 28.1
24 Hugh Black 261 26.1
# The word "AS" is actually optional. The following adds two new columns
# but does not use the word AS. All you need is the definition of the column
# followed by a space followed by the name of the new column.
sqldf("SELECT first, last, baseInThousands,
salesInThousands, 0.1 * salesInThousands commission,
salesInThousands * 0.1 + baseInThousands takeHome
FROM sales
WHERE region='africa'
")
first last baseInThousands salesInThousands commission takeHome
1 Emy Parsons 25 87 8.7 33.7
2 Monica Oja 35 98 9.8 44.8
3 Sue Aames 35 600 60.0 95.0
4 Laura White 20 281 28.1 48.1
5 Hugh Black 40 261 26.1 66.1
29.15 use aggregate functions in SELECT to create summary rows
# The dplyr summarize function is used to summarize (or aggregate) info from
# several rows into a single row.
#
# In the SQL SELECT statement, this is accomplished by simply using
# aggregate functions in the select clause.
#
# SQL has several built in standard aggregate functions
#
# count(*) - similar to n() in dplyr - we'll discuss why the * is there later
# avg(SOME_COLUMN)
# max(SOME_COLUMN)
# min(SOME_COLUMN)
sqldf("select count(*), avg(baseInThousands), max(baseInThousands)
FROM sales
ORDER BY region ASC, salesInThousands DESC")
count(*) avg(baseInThousands) max(baseInThousands)
1 24 32.29167 50
# As shown above we can assign names to the new columns.
# Again, as we said above, the word "AS" is optional. The following statement
# would work exactly the same way if we did not have the word "AS"
sqldf("SELECT count(*) as numberOfEmployees,
avg(baseInThousands) AS averagebaseInThousands,
max(baseInThousands) AS maxbaseInThousands
FROM sales
ORDER BY region ASC, salesInThousands DESC")
numberOfEmployees averagebaseInThousands maxbaseInThousands
1 24 32.29167 50
# This is the same query as above.
# This version does not have the word "AS".
# The results are exactly the same.
sqldf("SELECT count(*) numberOfEmployees,
avg(baseInThousands) averagebaseInThousands,
max(baseInThousands) maxbaseInThousands
FROM sales
ORDER BY region ASC, salesInThousands DESC")
numberOfEmployees averagebaseInThousands maxbaseInThousands
1 24 32.29167 50
29.16 WHERE clause
# WHERE is directly analogous to the dplyr filter function.
#
# The WHERE clause identifies the rows that will be returned.
# It takes a logical expression that uses the names of the columns.
# For every row the SELECT statement analyzes the row and calculates the
# result of the logical expression for that row. If the logical expression
# for a row is TRUE you get the row back. If not you do not get the row.
sqldf("SELECT *
FROM sales
WHERE salesInThousands < 100")
first last title region baseInThousands salesInThousands yearsWithCompany
1 Barb Brown Sr. Salesperson asia 35 79 4
2 Jack Martin Manager europe 20 40 3
3 Kate Carey Sr. Salesperson europe 20 69 2
4 Emy Parsons Salesperson africa 25 87 3
5 Monica Oja Salesperson africa 35 98 10
# This query uses aggregate functions in the SELECT but does NOT have a WHERE.
# Therefore the result is a summary of ALL rows in the table.
sqldf("SELECT count(*), min(baseInThousands), max(baseInThousands), avg(baseInThousands)
FROM sales")
count(*) min(baseInThousands) max(baseInThousands) avg(baseInThousands)
1 24 20 50 32.29167
# This is the same query but adds WHERE region='asia'.
# As a result the summary row only reflects info about the rows for 'asia'.
# Notice the there are fewer rows in the count(*) column and some of the
# other summary statistics are also different.
sqldf("SELECT count(*), min(baseInThousands), max(baseInThousands), avg(baseInThousands)
FROM sales
WHERE region='asia'")
count(*) min(baseInThousands) max(baseInThousands) avg(baseInThousands)
1 6 20 46 31.33333
29.17 ORDER BY clause
# The ORDER BY clause is directly anaogous to dplyr's arrange function
sqldf("SELECT *
FROM sales
WHERE salesInThousands < 100
ORDER By salesInThousands")
first last title region baseInThousands salesInThousands yearsWithCompany
1 Jack Martin Manager europe 20 40 3
2 Kate Carey Sr. Salesperson europe 20 69 2
3 Barb Brown Sr. Salesperson asia 35 79 4
4 Emy Parsons Salesperson africa 25 87 3
5 Monica Oja Salesperson africa 35 98 10
# Just as with dplyr's arrange function the rows can be ordered
# from largest to smallest by specifying desc, i.e. a descending order
# for the values of a column.
sqldf("SELECT *
FROM sales
WHERE salesInThousands < 100
ORDER BY salesInThousands DESC")
first last title region baseInThousands salesInThousands yearsWithCompany
1 Monica Oja Salesperson africa 35 98 10
2 Emy Parsons Salesperson africa 25 87 3
3 Barb Brown Sr. Salesperson asia 35 79 4
4 Kate Carey Sr. Salesperson europe 20 69 2
5 Jack Martin Manager europe 20 40 3
# Just as with dplyr's arrange function you can specify that the order
# of the rows should depend on multiple columns.
#
# The first column specified is used to order all of the rows.
# Subsequent columns mentioned in ORDER By are used only for rows
# in which the values for the earlier columns are the same.
#
# Each column could have an ascending (asc) or descending (desc) order.
# If neither asc nor desc is specified, then the default is an ascending order.
sqldf("SELECT *
FROM sales
ORDER BY region ASC, salesInThousands DESC")
first last title region baseInThousands salesInThousands yearsWithCompany
1 Sue Aames Sr. Salesperson africa 35 600 10
2 Laura White Manager africa 20 281 8
3 Hugh Black Sr. Salesperson africa 40 261 9
4 Monica Oja Salesperson africa 35 98 10
5 Emy Parsons Salesperson africa 25 87 3
6 Amy Markovic Salesperson asia 46 340 3
7 Monica Zeitchik Salesperson asia 23 184 1
8 Emy Zeitchik Manager asia 34 166 4
9 Sue Sallinger Salesperson asia 30 150 10
10 Zeke Smith Salesperson asia 20 120 3
11 Barb Brown Sr. Salesperson asia 35 79 4
12 Sam Lincoln Manager europe 30 500 2
13 Joe Washington Sr. Salesperson europe 33 370 2
14 Jim Brown Salesperson europe 50 167 2
15 Larry Green Sr. Salesperson europe 20 113 4
16 Kate Carey Sr. Salesperson europe 20 69 2
17 Jack Martin Manager europe 20 40 3
18 Barb Aames Salesperson usa 21 255 7
19 Zeke Cohen Manager usa 30 250 10
20 Sam Lee Salesperson usa 40 200 6
21 Kate Zeitchik Sr. Salesperson usa 50 187 4
22 Jim Kuruzovich Sr. Salesperson usa 35 140 3
23 Jack Aames Salesperson usa 43 105 4
24 Joe Smith Manager usa 40 100 10
29.18 GROUP BY clause
# The GROUP BY clause in SQL is directly analogous to the group_by function in dplyr.
#
# All of the rows that have the same value for the specified GROUP BY columns
# are aggregated (i.e. summarized) in a single line of output.
#
# GROUP BY should only be used if the SELECT clause includes aggregate functions.
sqldf("SELECT title, count(*), avg(baseInThousands) avgBase, max(baseInThousands) maxBase
FROM sales
GROUP BY title
ORDER BY title")
title count(*) avgBase maxBase
1 Manager 6 29.0 40
2 Salesperson 10 33.3 50
3 Sr. Salesperson 8 33.5 50
# Grouping by a different column - region
sqldf("SELECT region, count(*), avg(baseInThousands) avgBase, max(baseInThousands) maxBase
FROM sales
GROUP BY region
ORDER BY region")
region count(*) avgBase maxBase
1 africa 5 31.00000 40
2 asia 6 31.33333 46
3 europe 6 28.83333 50
4 usa 7 37.00000 50
# Just as with dplyr, the groups can be defined by more than one column.
#
# The following query treats all of the rows that match in both the
# region and title columns as a single group.
#
# For example, all of the original rows from the sales table
# that have a title of "Salesperson" and a region of "asia"
# are considered to be part of the same group and are summarized
# in a single row of the output.
sqldf("SELECT title, region, count(*), avg(baseInThousands) avgBase, max(baseInThousands) maxBase
FROM sales
GROUP BY title, region
ORDER BY title, region")
title region count(*) avgBase maxBase
1 Manager africa 1 20.00000 20
2 Manager asia 1 34.00000 34
3 Manager europe 2 25.00000 30
4 Manager usa 2 35.00000 40
5 Salesperson africa 2 30.00000 35
6 Salesperson asia 4 29.75000 46
7 Salesperson europe 1 50.00000 50
8 Salesperson usa 3 34.66667 43
9 Sr. Salesperson africa 2 37.50000 40
10 Sr. Salesperson asia 1 35.00000 35
11 Sr. Salesperson europe 3 24.33333 33
12 Sr. Salesperson usa 2 42.50000 50
# A similar query without a GROUP BY returns just a single row that
# summarizes the data from all rows of the table.
sqldf("SELECT count(*), avg(baseInThousands) avgBase, max(baseInThousands) maxBase
FROM sales")
count(*) avgBase maxBase
1 24 32.29167 50
29.19 LIMIT clause
The LIMIT clause is similar to print(n=…) and slice_head(n=…) with dplyr. It limits the number of rows returned to the first few that would have been returned had the query not included the LIMIT clause.
LIMIT is often used in combination with the ORDER BY clause.
LIMIT must come at the very end of the SQL statement.
sqldf("SELECT *
FROM sales
ORDER BY salesInThousands desc
LIMIT 10")
first last title region baseInThousands salesInThousands yearsWithCompany
1 Sue Aames Sr. Salesperson africa 35 600 10
2 Sam Lincoln Manager europe 30 500 2
3 Joe Washington Sr. Salesperson europe 33 370 2
4 Amy Markovic Salesperson asia 46 340 3
5 Laura White Manager africa 20 281 8
6 Hugh Black Sr. Salesperson africa 40 261 9
7 Barb Aames Salesperson usa 21 255 7
8 Zeke Cohen Manager usa 30 250 10
9 Sam Lee Salesperson usa 40 200 6
10 Kate Zeitchik Sr. Salesperson usa 50 187 4
29.19.1 LIMIT vs LIMIT OFFSET
In general the limit clause has two different forms:
<SOME SELECT STATEMENT>
limit M
(where M is an integer) results in only the first M rows of data from what would normally have been displayed had the LIMIT clause not been specified.
<SOME SELECT STATEMENT>
limit M offset N
(where both M and N are integers)
starts the output from the N+1’th row of what would normally have been displayed without the limit clause and then displays the next M rows of data.
For example:
# Display the first 3 people
sqldf("SELECT *
FROM sales
ORDER BY last, first
LIMIT 3")
first last title region baseInThousands salesInThousands yearsWithCompany
1 Barb Aames Salesperson usa 21 255 7
2 Jack Aames Salesperson usa 43 105 4
3 Sue Aames Sr. Salesperson africa 35 600 10
# Display the next 3 people
sqldf("SELECT *
FROM sales
ORDER BY last, first
LIMIT 3 OFFSET 3")
first last title region baseInThousands salesInThousands yearsWithCompany
1 Hugh Black Sr. Salesperson africa 40 261 9
2 Barb Brown Sr. Salesperson asia 35 79 4
3 Jim Brown Salesperson europe 50 167 2
# Display the next 3 people
sqldf("SELECT *
FROM sales
ORDER BY last, first
LIMIT 3 OFFSET 6")
first last title region baseInThousands salesInThousands yearsWithCompany
1 Kate Carey Sr. Salesperson europe 20 69 2
2 Zeke Cohen Manager usa 30 250 10
3 Larry Green Sr. Salesperson europe 20 113 4
Note that the LIMIT clause is not standard SQL - some SQL flavors do not contain a LIMIT clause. However, the LIMIT clause or something similar is part of most SQL flavors.
29.20 DO NOT select both aggregate and non-aggregate data
SUMMARY OF THIS SECTION: If a SELECT clause includes aggregate functions then the SELECT clause should NOT include non-aggregated columns, unless those columns are part of the GROUP BY clause. (The rest of this section explains this rule.) IMPORTANT - if you don’t follow this rule, then you will NOT get an error but your output will be misleading.
Before reading the rest of this section I highly recommend that you review the sections above on aggreate functions and the GROUP BY clause
29.20.1 Example of INCORRECT select that includes both aggregate and non-aggregated columns
Examine the query shown below. The SELECT incorrectly uses the aggregate functions avg and max mixed together with non-aggregate data, i.e. region (Note that the region column is NOT part of the GROUP BY). The results are misleading.
Even though the underlying data in the sales table contains Managers for all of the regions (africa, asia, europe, usa), the results below seem to imply that the data in the “Manager” row is only for Managers for the “usa” region. However, that is NOT the case. The maxBase and avgYears columns show the maxBase and avgYears for ALL managers (not just for those from the usa region). The fact that the output shows “usa” and not “africa”, “asia” or “europe” is by chance. There is no guarantee that one of the other regions would be shown in similar queries.
# NOTE: the -- in the command below introduces a comment in SQLite sqldf("select title, region, -- DO NOT DO THIS - region is NOT in the GROUP BY avg(baseInThousands) as avgBase, max(baseInThousands) as maxBase, avg(yearsWithCompany) as avgYears from sales group by title order by title")
title region avgBase maxBase avgYears 1 Manager usa 29.0 40 6.166667 2 Salesperson europe 33.3 50 4.900000 3 Sr. Salesperson usa 33.5 50 4.750000
The following has the same problem but for a query that doesn’t contain a GROUP BY. The results shown below seem to imply that the avgBase, maxBase and avgYears columns only take into account rows from the “usa” region. However, that is NOT the case. the avgBase, maxBase and avgYears take into account rows from ALL regions.
# NOTE: the -- in the command below introduces a comment in SQLite sqldf("select region, -- DO NOT DO THIS - region is NOT in the GROUP BY avg(baseInThousands) as avgBase, max(baseInThousands) as maxBase, avg(yearsWithCompany) as avgYears from sales ")
region avgBase maxBase avgYears 1 usa 32.29167 50 5.166667
29.20.2 Explanation
If a SELECT clause contains an aggregate function then one of the following two things happen:
If there is no GROUP BY, then only one row is returned.
click for exampleThe following SELECT clause includes aggregate functions and there is NO group by. Therefore only one row is returned.
sqldf("select avg(baseInThousands) as avgBase, max(baseInThousands) as maxBase, avg(yearsWithCompany) as avgYears from sales")
avgBase maxBase avgYears 1 32.29167 50 5.166667
If there IS a GROUP BY then one row is returned per group of rows in the table. (see the GROUP BY clause above)
click for exampleThe following statement is very similar to the statement above. However, this statement includes a GROUP BY. Therefore a single row is returned PER GROUP. Note that the SELECT clause also includes the title column, which is also in the GROUP BY clause.
sqldf("select title, avg(baseInThousands) as avgBase, max(baseInThousands) as maxBase, avg(yearsWithCompany) as avgYears from sales group by title")
title avgBase maxBase avgYears 1 Manager 29.0 40 6.166667 2 Salesperson 33.3 50 4.900000 3 Sr. Salesperson 33.5 50 4.750000
This makes sense since an aggregate function (e.g. avg, sum, min, max, count) takes multiple values from different rows in the table and returns only a single value. A select statement that doesn’t have a GROUP BY passes the values from all rows in the table to the aggregate function. By contraste, a select with a GROUP BY calls the aggregate functions once for each group of rows.
A SELECT clause that includes aggregated columns as well as
non-aggregated columns is problematic.0 The aggregate function(s) return a single value so only a single row should be diplayed. However, the non-aggregated values could have multiple values and hence multiple rows. This causes a conflict (a) should only one row be returned (because of the aggregate functions) or (b) should more than one row be returned (because of the non-aggregated data)? The rules of SQL say that ONLY ONE ROW is returned. However, ONE OF THE SEVERAL non-aggregated values will also be returned. This could lead to misleading output.
29.20.3 ANSI SQL standard vs SQLite vs other databases
According to the ANSI SQL standard, queries such as those listed above that include both aggregate and non-aggregate columns should cause an ERROR. However, in actuality, different RDBMS software treat this issue differently. By default, when using sqldf() within R, the SQLite version of the SQL is used. SQLite does NOT return an error when issuing such queries. However, as shown above, the results can be very misleading. SQLite’s approach is that the person writing the query is expected to be responsible enough to not break this rule. Other RDBMS software may issue ERRORs or warnings or take a similar approach to SQLite (see the documentation for the specific RDBMS software).
Why doesn’t SQLite check for this error?
There are situations where this rule can be too limiting. For example, the following grades table lists different grades that students got in a class. Each row lists the student’s ID and the student’s social security number (ssn), both of which could theoretically be a primary key for the table (i.e. they are “candiate primary keys”).
sqldf("select * from grades")
studentId ssn assessment grade 1 001 111-11-1111 test1 70 2 001 111-11-1111 test2 80 3 002 222-22-2222 test1 90 4 002 222-22-2222 test2 100
The following query breaks the rule described above (ssn is in the SELECT but not in the GROUP BY). Nevertheless, the query gives accurate results. This is because the studentId and the ssn are guaranteed to each contain the same unique values for a specific student.
sqldf("select studentId, ssn, avg(grade) from grades group by studentId order by studentId")
studentId ssn avg(grade) 1 001 111-11-1111 75 2 002 222-22-2222 95
Similarly, this also breaks the rule (i.e. studentId is in the SELECT but not in the GROUP BY) but the results are accurate:
sqldf("select studentId, ssn, avg(grade) from grades group by ssn order by studentId")
studentId ssn avg(grade) 1 001 111-11-1111 75 2 002 222-22-2222 95
Bottom line is that SQLite requires the person who is writing the SQL to do it correctly!