library(DBI)
library(RSQLite)
library(sqldf)
37 37a. Working with SQLite databases in R
NOTE: This section contains a brief overview for how to work specifically with SQLite databases in R. This section gives just enough information to get working. More elaborate coverage for how to work with any type of RDBMS (relations database management system software) in R appears in the next section.
37.1 load the DBI , RSQLite and sqldf packages
In order to work with a SQLite database from R we will make use of the following packages.
DBI (i.e. “DataBase Interface) - this is a package that contains functions for interacting with many different relational database management system (RDBMS) software products. This same package would be used for communicating with all of the following SQL products:
- MySql
- Postgres
- SQLite
- SQL Server
- many many others
This package includes the following “generic” functions
- dbConnect()
- dbDisconnect()
- dbExecute()
- many other other functions
These “generic functions” provide some basic functionality that for databases. However, the specifics of each RDBMS system are not coded here. Rather each RDBMS (e.g. MySql, Postgres, SQLite, etc) would have a different package written just for that RDBMS that understands the specifics of how that particular package works.
RSQLite - this package provides functionality that is needed to work with the SQLite product. It provides functions that understand the specifics for how to work with SQLite databases (as opposed to MySql, Postgres or other competing offerings). You must use this in conjunction with the DBI package. In general, you call the functions from the DBI package which in turn call the functions in the RSQLite package.
sqldf - this is the same package we’ve used before for accessing R dataframes with SELECT statements. We will use the sqldf function now to work with data that is contained inside an actual SQLite database.
Details for how to use these packages are covered below
37.2 Housekeeping … start from a clean slate
The following code simply removes any database that we had from before. We just want to start from a clean slate for this presentation:
library(fs)
file_delete("mydb.sqlite") # remove the mydb.sqlite file
37.3 Connecting to a database
In order to access the data in a SQLite database (or any other RDBMS database) you must create a “connection” object. You do that by calling dbConnect()
37.3.1 RSQLite::SQLite() creates a “driver” for sqlite, dbConnect() creates a connection
Remember that the DBI package understands the generic concept of databases but the RSQLite package knows about the specifics related to SQLite. Therefore we call the RSQLite::SQLite() function to create a “driver” for the SQLite() database.
A “driver” is a piece of software that knows the specifics of how to use a specific pieced of software or hardware. For example when you buy a printer you may need to install “driver” software for your printer in order for your computer to know how to use your specific printer’s features. Very often these driver software products are already installed for major printer manufacturers on your computer. However, you many need to install an updated driver to get your computer to know how to use the new features on your brand new printer.
We pass the “driver” to the dbConnect function. dbConnect() then returns a connection object, that will need to be passed to any other function that we use to interface with out database.
# Create a connection to the database.
# If the database doesn't exist then this will also create the database.
= dbConnect(RSQLite::SQLite(), "mydb.sqlite") con
37.3.2 Call dbDisconnect(con) when you are done.
When you are finished using the database you should make sure to call dbDisconnect(con), otherwise the mydb.sqlite file will be considered to be “busy” and you will not be able to remove it or do other similar things with it.
37.4 Running SQL commands in the database with dbExecute and sqldf
We will use the dbExecute to run any type of SQL command (other than SELECT) in the database.
We will continue to use the sqldf function specifically for running SELECT statements in the database.
37.4.1 dbExecute
The dbExecute command will take a SQL command and execute it in the database. (NOTE that for SELECT commands we will continue to use sqldf instead of dbExecute).
The dbExecute function takes the connection object as its first argument and the actual SQL command as its 2nd argument.
The return value of dbExecute is the number of rows that were affected by the command. This is mainly an issue for INSERT, UPDATE, DELETE and other commands that modify the contents of a table. In the following example, we are calling dbExecute to create a table. Therefore the function returns 0 since zero rows are affected by the creation of a table.
# To run any SQL command other than SELECT you can use
# the dbExecute function from the DBI package.
#
# You need to supply the connection that you made as
# the first argument ot the dbExecute function.
dbExecute(conn = con,
"create table authors (
au_id text,
au_fname text,
au_lname text)"
)
[1] 0
37.4.2 Use sqldf( SOME_SELECT_STATEMENT, connection = con ) for SELECT statements
As always, you can use the sqldf function to run SELECT statements. To run the select against the SQLite database, you must provide the connection variable that you created above to the connection argument of sqldf.
Since we haven’t inserted any rows into this table yet, the SELECT should return a dataframe that contains zero rows.
sqldf("select * from authors", connection = con)
[1] au_id au_fname au_lname
<0 rows> (or 0-length row.names)
IMPORTANT - “connection=con” (not just “con”)
In the call to sqldf above, notice that we specified the name of the connection argument (i.e. connection=con
not just con
). If you leave out connection=
, you will likely get an error as shown here.
sqldf("select * from authors", con) # ERROR - doesn't include connection=con
Error: no such table: authors
The reason for this error is that the connection argument is NOT the second argument to sqldf. The full list of arguments for sqldf is listed in the “Usage” section at the top of the help page for sqldf (type ?sqldf).
Usage
sqldf(x, stringsAsFactors = FALSE, row.names = FALSE, envir = parent.frame(), method = getOption("sqldf.method"), file.format = list(), dbname, drv = getOption("sqldf.driver"), user, password = "", host = "localhost", port, dll = getOption("sqldf.dll"), connection = getOption("sqldf.connection"), verbose = isTRUE(getOption("sqldf.verbose")))
In the code above that generated the error, sqldf("select * from authors", con)
, the con object is actually being passed to the stringsAsFactors
argument (which is in fact the 2nd argument to sqldf) and not to the connection
argument. The connection
argument appears much later in sqldf’s argument list and isn’t being passed anything. This is why you need to specify connection=con
and not just con
.
Notice that the dbExecute function does not have this issue. ?dbExecute shows that the name of the argument is conn (not connection) and that conn is the first argument.
Usage
dbExecute(conn, statement, ...)
The reason for the inconsistency is probably because the DBI and sqldf packages were developed at different times by different people with different design philosophies. Unfortunately, this kind of inconsistency is relatively common in the R ecosystem because it’s grown organically with contributions from many different people.
37.5 Inserting data into the database
37.5.1 Inserting rows into a table
We can use the insert SQL statement to insert a row. Remember that the dbExecute function returns the number of rows affected by the command. Therefore, if everything works correctly, this call to dbExecute should return 1.
dbExecute(con, "insert into authors values('a100','anne','andrews')")
[1] 1
We now can see that row in the database.
sqldf("select * from authors", connection = con)
au_id au_fname au_lname
1 a100 anne andrews
37.5.2 Dynamically creating an INSERT statement from data.
# Suppose you wanted to insert the following data
# into a row of the authors table.
= 'a200'
authorId = 'bill'
authorFname = 'bell'
authorLname
# We can use the code below to build up the following SQL insert statement:
#
# insert into authors values( 'a200' , 'bill' , 'jones' )
= paste0(
insertStatement "insert into authors ",
"values(",
"'", authorId, "'", "," ,
"'", authorFname, "'", "," ,
"'", authorLname, "'",
")"
)
# This is the insert statement that we created.
insertStatement
[1] "insert into authors values('a200','bill','bell')"
# We can now use dbExecute to enter the data into the database
dbExecute(conn = con, insertStatement)
[1] 1
# We can query the database to make sure that the data is there.
sqldf("select * from authors order by au_id", connection = con)
au_id au_fname au_lname
1 a100 anne andrews
2 a200 bill bell
37.5.3 Using parameters with dbExecute()
# Another way (arguably a better way) to do this is to use the "params"
# argument to dbExecute. To use params, you should type "?" characters
# in the SQL command as placeholders for data to be inserted into the
# SQL command. The values that you pass to the params argument
# will replace the "?" characters. The values that you pass to params
# should be in an R list.
#
# In the following example, we insert a new author row using this technique.
# Let's pretend that these variables came from an API or any other source of info
# (e.g. a file or user input).
= 'a300'
authorId = 'carie'
authorFname = 'cox'
authorLname
# Notice the ? marks and the params argument that contains the actual
# values to replace the ? marks.
dbExecute(con, "insert into authors values (?,?,?)",
params = list( authorId, authorFname, authorLname ))
[1] 1
# We can see that we got the new author added.
sqldf("select * from authors order by au_id", connection = con)
au_id au_fname au_lname
1 a100 anne andrews
2 a200 bill bell
3 a300 carie cox
37.6 Using an API and dynamically inserting the returned data into a database.
# Suppose that you retrieved some data from an API
# and that you got the data from the API
# into an R dataframe by using jsonlite::fromJSON( ... )
#
# We will pretend that we did so and got back the following
# dataframe with these authors.
= data.frame(authorFirstName = c("dianne","ed", "fran"),
df authorLastName = c("davis", "ellis", "fink"))
df
authorFirstName authorLastName
1 dianne davis
2 ed ellis
3 fran fink
# We can use a loop to get the authors names from each row of the
# the dataframe into a separate insert statement and then
# insert that author's row.
# If the API didn't return any IDs for these authors,
# we can create our own IDs inside the loop using R code.
for (rownum in 1:nrow(df)){
= df[rownum, "authorFirstName" ]
authorFirstName = df[rownum, "authorLastName" ]
authorLastName
# Get the largest ID so far and add one to it
= sqldf("select max(au_id) from authors", connection = con)
largestId = gsub("[0-9]*$", "", largestId) # Remove the numbers on the right
leadingLetters = gsub("^[^0-9]*", "", largestId) # Remove the non-numeric characters on the left
numberPortion = paste0(leadingLetters, as.numeric(numberPortion) + 100)
newId
# Notice the ? marks and the params argument that contains the actual
# values to replace the ? marks.
dbExecute(con, "insert into authors values (?,?,?)",
params = list( newId, authorFirstName , authorLastName ))
}
# Show that all the rows were added to the database.
sqldf("select * from authors order by au_id", connection = con)
au_id au_fname au_lname
1 a100 anne andrews
2 a200 bill bell
3 a300 carie cox
4 a400 dianne davis
5 a500 ed ellis
6 a600 fran fink
37.7 Using AUTOINCREMENT to create IDs
The AUTOINCREMENT option in the CREATE TABLE statement will automatically insert a new unique integer in a new row. This is often very useful for primary key columns.
The AUTOINCREMENT columns should be INTEGER columns.
Whenever a new row is inserted if the value for the autoincrement column(s) are not specified then the value of that column will be one more than the max value currently in that column.
To demonstrate this, we will create a new table named titles that contains an INTEGER column named title_id with the “autoincrement” option.
dbExecute(conn = con,
"create table titles (
title_id integer primary key autoincrement,
title_name text,
pages integer
)"
)
[1] 0
If we insert a row, but don’t supply the title_id, a title_id will automatically be generated. To make this work you must use the form of the SQL INSERT statement that specified the names of the columns you are inserting into, for example:
insert into titles (title_name, pages) values(‘War & Peace’, 1000)“)
dbExecute(conn=con,
"insert into titles (title_name, pages) values('War & Peace', 1000)")
[1] 1
sqldf("select * from titles", connection = con)
title_id title_name pages
1 1 War & Peace 1000
dbExecute(conn=con,
"insert into titles (title_name, pages) values('Harry Potter 1', 800)")
[1] 1
dbExecute(conn=con,
"insert into titles (title_name, pages) values('Harry Potter 2', 950)")
[1] 1
# You can see that the title_id column values were automatically generated.
sqldf("select * from titles order by title_id", connection = con)
title_id title_name pages
1 1 War & Peace 1000
2 2 Harry Potter 1 800
3 3 Harry Potter 2 950
If you want to specify an explicit title_id, you still can do so. Any subsequent inserts that don’t specify the title_id will continue with the number after the currently largest title_id value.
# Specify a specific title_id value
dbExecute(conn=con,
"insert into titles (title_id, title_name, pages) values(999, 'Harry Potter 3', 950)")
[1] 1
sqldf("select * from titles order by title_id", connection = con)
title_id title_name pages
1 1 War & Peace 1000
2 2 Harry Potter 1 800
3 3 Harry Potter 2 950
4 999 Harry Potter 3 950
# New rows that use an AUTOINCREMENT'ed title_id will continue from there
dbExecute(conn=con,
"insert into titles (title_name, pages) values('Harry Potter 4', 750)")
[1] 1
# See the new rows
sqldf("select * from titles order by title_id", connection = con)
title_id title_name pages
1 1 War & Peace 1000
2 2 Harry Potter 1 800
3 3 Harry Potter 2 950
4 999 Harry Potter 3 950
5 1000 Harry Potter 4 750
37.8 Remember to call dbDisconnect(con) when you’re done.
When you are finished using the database you should make sure to call dbDisconnect(con), otherwise the mydb.sqlite file will be considered to be “busy” and you will not be able to remove it or do other similar things with it.
dbDisconnect(con)