34  34. Common Table Expressions (CTE)

34.1 SQL Common Table Expressions (CTE)

You can also see the following to learn about this important feature of modern SQL:

https://learnsql.com/blog/what-is-common-table-expression/


34.2 CTE with one or more SELECTs

“Common Table Expressions (CTE)” are designed to make complex queries easier to read and think about. Instead of embedding subqueries in the middle of an outer query CTEs allow you to several SELECT statements in an organized way. These SELECT statements are named and can be referred to as if they were tables inside of the main query. See below for exmaples.

(Note - these examples work but were generated by chatgpt. I should modify these to make them more interesting, but the current code works to get across the main ideas of CTEs)

The following is the database that we will be using to demo the CTEs below.

Click here to hide/show this section

34.3 The “Books Database”

The “books database” is a collection of several tables that contains data relating to books (i.e. “titles”), authors, publishers, etc.

You can download the data for the database in the following ways:

  • as a set of several CSV files or (see below)

  • as a single sqlite3 database file (see below)

See below for details.

34.3.1 SQLite file for “Books Database”

Click here to download all the data for the books datbase as a single “SQLite” database file.

To use this file you can do one of the following:

  1. Install SQLite software on your computer. There are many such programs avaiable. I recommend “DB Viewer for SQLite” which is a free open source program.

    Website (download from here): https://sqlitebrowser.org/

    Github repository: https://github.com/sqlitebrowser/sqlitebrowser

    You can search online for other similar programs for working with sqlite3 files.

  2. Free web based tool: https://y-rosenthal.github.io/yrSqliteViewer/yrSqliteViewer.html

    This is a website that lets you upload a sqlite3 file and run SQL Select statements against the data. I created this website myself (with the help of some AI coding assistants).

    PROs: you don’t need to install anything

    CONs: currently only works with SQL SELECT statment but not any other types of sql statements.

  3. See this chapter (in this book) https://y-rosenthal.github.io/DataManagementUsingR/sql0950-workingWithDatabasesInR-v010.html for instructions on how to access this database file directly via R commands.

34.3.2 CSV files for “Books Database”

Download the CSV files for the “books database”. These CSV files together comprise the data for the “books database”.

Download a tutorial on using Relational databases.

This word document contains:

  1. a complete description of the “books database”
  2. a tutorial on how “relational databases” are structured
  3. a tutorial on Entity Relationship Diagrams (ERDs)
  4. a set of questions and answers

Once you’ve downloaded the various CSV files you can run the following commands to import the data into R.

34.3.3 Import the data (books database)

if(!require(readr)){install.packages("readr");require(readr);}
Loading required package: readr
# Read in the data for the books database - see the 
titles = read_csv("titles.csv", na="NULL", show_col_types=FALSE)
authors = read_csv("authors.csv", na="NULL", show_col_types=FALSE)
publishers = read_csv("publishers.csv", na="NULL", show_col_types=FALSE)
title_authors = read_csv("title_authors.csv", na="NULL", show_col_types=FALSE)
royalties = read_csv("royalties.csv", na="NULL", show_col_types=FALSE)

34.3.4 Books Database - Entity Relationship Diagram (ERD)

The following is an “Entity Relationship Diagram (ERD)” that describes the relationships between the tables. See the Word document linked above for more info:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#ffffff', 'primaryTextColor': '#0000ff', 'primaryBorderColor': '#000000', 'lineColor': '#000000', 'secondaryColor': '#ffffff', 'tertiaryColor': '#ffffff', 'mainBkg': '#ffffff', 'background': '#ffffff', 'edgeLabelBackground': '#ffffff' }}}%%
erDiagram
    authors ||--o{ title_authors : ""
    title_authors }|--|| titles : ""
    titles ||--o| royalties : ""
    publishers |o--o{ titles : ""
    
    authors {
        string au_id PK
        string au_fname
        string au_lname
        string phone
        string address
        string city
        string state
        string zip
    }
    
    title_authors {
        string title_id FK1
        string au_id FK2
        int au_order
        float royalty_share
    }
    
    titles {
        string title_id PK
        string title_name
        string type
        string pub_id FK
        int pages
        float price
        int sales
        date pubdate
    }
    
    publishers {
        string pub_id PK
        string pub_name
        string city
        string state
        string country
    }
    
    royalties {
        string title_id PK
        float advance
        float royalty_rate
    }

Below are the contents of these tables:

sqldf("select * from titles")
   title_id                          title_name       type pub_id pages price
1       T01                               1977!    history    P01   107 21.99
2       T02           200 Years of German Humor    history    P03    14 19.95
3       T03        Ask Yor System Administrator   computer    P02  1226 39.95
4       T04           But I Did It Unconciously psychology    P01   510 12.99
5       T05              Exchange of Platitudes psychology    P01   201  6.95
6       T06                    How About Never?  biography    P01   473 19.95
7       T07                   I Blame My Mother  biography    P03   333 23.95
8       T08        Just Wait Until After School   children    P01    86 10.00
9       T09                     Kiss My Boo Boo   children    P01    22 13.95
10      T10         Not Without My Fabrerge Egg  biography    P05    NA    NA
11      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99
12      T12           Spontaneous, Not Annoying  biography    P05   507 12.99
13      T13 What Are The Civilian Applications?    history    P03   802 29.99
     sales         pubdate
1      566   8/1/2000 0:00
2     9566   4/1/1998 0:00
3    25667   9/1/2000 0:00
4    13001  5/31/1999 0:00
5   201440   1/1/2001 0:00
6    11320  7/31/2000 0:00
7  1500200  10/1/1999 0:00
8     4095   6/1/2001 0:00
9     5000  5/31/2002 0:00
10      NA            <NA>
11   94123 11/30/2000 0:00
12  100001  8/31/2000 0:00
13   10467  5/31/1999 0:00
sqldf("select * from publishers")
  pub_id             pub_name          city state country
1    P01    Abatis Publishers      New York    NY     USA
2    P02      Core Dump Books San Francisco    CA     USA
3    P03 Schandenfreude Press       Hamburg  <NA> Germany
4    P04    Tneterhooks Press      Berkeley    CA     USA
5    P05       AAA Publishing      Berkeley    CA     USA
sqldf("select * from title_authors")
   title_id au_id au_order royalty_shares
1       T01   A01        1            1.0
2       T02   A01        1            1.0
3       T03   A05        1            1.0
4       T04   A03        1            0.6
5       T04   A04        2            0.4
6       T05   A04        1            1.0
7       T06   A02        1            1.0
8       T07   A02        1            0.5
9       T07   A04        2            0.5
10      T08   A06        1            1.0
11      T09   A06        1            1.0
12      T10   A02        1            1.0
13      T11   A03        2            0.3
14      T11   A04        3            0.3
15      T11   A06        1            0.4
16      T12   A02        1            1.0
17      T13   A01        1            1.0
sqldf("select * from authors")
  au_id  au_fname    au_lname        phone              address          city
1   A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx
2   A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder
3   A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco
4   A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco
5   A05 Christian       Kells 212-771-4680       114 Horatio St      New York
6   A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto
7   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota
  state   zip
1    NY 10468
2    CO 80303
3    CA 94123
4    CA 94123
5    NY 10014
6    CA 94305
7    FL 34236
sqldf("select * from royalties")
   title_id advance royalty_rate
1       T01   10000         0.05
2       T02    1000         0.06
3       T03   15000         0.07
4       T04   20000         0.08
5       T05  100000         0.09
6       T06   20000         0.08
7       T07 1000000         0.11
8       T08       0         0.04
9       T09       0         0.05
10      T10      NA           NA
11      T11  100000         0.07
12      T12   50000         0.09
13      T13   20000         0.06

Goal: Show each author’s full name, how many books they’ve written, and a list with one row per book and a total count included.

sqldf("
WITH author_titles_cte AS (
  SELECT 
    a.au_id,
    a.au_fname || ' ' || a.au_lname AS author,
    t.title_name
  FROM authors a
  JOIN title_authors ta ON a.au_id = ta.au_id
  JOIN titles t ON ta.title_id = t.title_id
),
author_book_counts AS (
  SELECT au_id, COUNT(*) AS book_count
  FROM author_titles_cte
  GROUP BY au_id
)
SELECT 
  at.author,
  at.title_name,
  abc.book_count
FROM author_titles_cte at
JOIN author_book_counts abc ON at.au_id = abc.au_id
ORDER BY abc.book_count DESC, at.author, at.title_name
")
            author                          title_name book_count
1        Klee Hull           But I Did It Unconciously          4
2        Klee Hull              Exchange of Platitudes          4
3        Klee Hull                   I Blame My Mother          4
4        Klee Hull    Perhaps It's a Glandular Problem          4
5  Wendy Heydemark                    How About Never?          4
6  Wendy Heydemark                   I Blame My Mother          4
7  Wendy Heydemark         Not Without My Fabrerge Egg          4
8  Wendy Heydemark           Spontaneous, Not Annoying          4
9   Harvey Kellsey        Just Wait Until After School          3
10  Harvey Kellsey                     Kiss My Boo Boo          3
11  Harvey Kellsey    Perhaps It's a Glandular Problem          3
12   Sarah Buchman                               1977!          3
13   Sarah Buchman           200 Years of German Humor          3
14   Sarah Buchman What Are The Civilian Applications?          3
15     Hallie Hull           But I Did It Unconciously          2
16     Hallie Hull    Perhaps It's a Glandular Problem          2
17 Christian Kells        Ask Yor System Administrator          1

Goal: Find authors who have written books with more than 10,000 sales and show the average sales per author.

sqldf("
WITH high_selling_titles AS (
  SELECT title_id, sales
  FROM titles
  WHERE sales > 10000
),
authors_of_hits AS (
  SELECT a.au_id, a.au_fname || ' ' || a.au_lname AS author, t.sales
  FROM high_selling_titles h
  JOIN title_authors ta ON h.title_id = ta.title_id
  JOIN authors a ON ta.au_id = a.au_id
  JOIN titles t ON h.title_id = t.title_id
)
SELECT author, AVG(sales) AS avg_sales
FROM authors_of_hits
GROUP BY author
ORDER BY avg_sales DESC
")
           author avg_sales
1 Wendy Heydemark  537173.7
2       Klee Hull  452191.0
3  Harvey Kellsey   94123.0
4     Hallie Hull   53562.0
5 Christian Kells   25667.0
6   Sarah Buchman   10467.0

34.4 CTE Referring to an Earlier CTE

Goal: Find top-earning books and list their authors. Top-earning = (sales × price) > 500,000.

sqldf("
WITH book_earnings AS (
  SELECT title_id, title_name, sales * price AS revenue
  FROM titles
),
top_books AS (
  SELECT title_id, title_name, revenue
  FROM book_earnings
  WHERE revenue > 500000
),
top_books_authors AS (
  SELECT tb.title_name, tb.revenue, a.au_fname || ' ' || a.au_lname AS author
  FROM top_books tb
  JOIN title_authors ta ON tb.title_id = ta.title_id
  JOIN authors a ON ta.au_id = a.au_id
)
SELECT * FROM top_books_authors
ORDER BY revenue DESC
")
                        title_name    revenue          author
1                I Blame My Mother 35929790.0 Wendy Heydemark
2                I Blame My Mother 35929790.0       Klee Hull
3           Exchange of Platitudes  1400008.0       Klee Hull
4        Spontaneous, Not Annoying  1299013.0 Wendy Heydemark
5     Ask Yor System Administrator  1025396.7 Christian Kells
6 Perhaps It's a Glandular Problem   752042.8     Hallie Hull
7 Perhaps It's a Glandular Problem   752042.8       Klee Hull
8 Perhaps It's a Glandular Problem   752042.8  Harvey Kellsey

34.5 Recursive CTEs

This is a more advanced topic. See here for info on how to use. https://learnsql.com/blog/sql-recursive-cte/

The following is an example from the books database:

Goal: Given an author, find all other authors who co-authored any book with them, directly or through a chain of co-authors.

Let’s assume we start from author A03.

sqldf("
  WITH RECURSIVE coauthors_cte(au_id, coauthor_id, level) AS (
    -- Anchor: Get direct co-authors of A03
    SELECT ta1.au_id, ta2.au_id, 1
    FROM title_authors ta1
    JOIN title_authors ta2 ON ta1.title_id = ta2.title_id
    WHERE ta1.au_id = 'A03' AND ta2.au_id != 'A03'
  
    UNION
  
    -- Recursive: Get co-authors of the previous level
    SELECT c.coauthor_id, ta2.au_id, level + 1
    FROM coauthors_cte c
    JOIN title_authors ta1 ON c.coauthor_id = ta1.au_id
    JOIN title_authors ta2 ON ta1.title_id = ta2.title_id
    WHERE ta2.au_id != c.coauthor_id
  )
  SELECT DISTINCT coauthor_id
  FROM coauthors_cte
  ORDER BY coauthor_id;
")