39  38. Database Normalization

The same data can be arranged in a database in many different ways. This section describes how to intelligently arrange your data in a relational database.

Database normalization is the process of designing database tables so that

39.1 Normal Forms

Database normalization is accomplished in several steps. There are different “levels” of normalization, each level progressing towards a more perfect organization of the data. Each level is called a “Normal Form”. We say that a database is “in” a particular normal form when the tables are structured in a way that conforms to the rules of that normal form. This will all become much more clear as you progress through the material below. In a fully normalized database the only data that should be repeated are the primary/foreign keys.

The following are the various normal forms. You are not expected to understand the technical terms in the list below. These will be covered in the material that follows this section.

  • 1NF (1st Normal Form) - only atomic values in columns, no repeating groups or arrays
  • 2NF (2nd Normal Form) - no “partial functional dependencies (pFD)”
  • 3NF (3rd Normal Form) - no “transitive functional dependencies (tFD)”

Most database designers aim to have their databases at least in 3rd Normal Form. However, the following additional normal forms are used to further “normalize” a database to deal with special cases.

In this material we will only be covering 1st, 2nd and 3rd normal forms. The list below is provided in case you want to do your own research.

  • Bocye-Codd Normal Form (BCNF) - every determinant (attribute that determines other attributes) must be a candidate key
  • 4NF - no multi-valued dependencies
  • 5NF - tables cannot be losslessly decomposed into smaller tables
  • 6NF - tables must be irreducible, containing no non-trivial join dependencies

39.2 First Normal Form (1NF)

First normal form requires that

  • There be only one value in each “cell” of a table.
    (AKA “all attributes contain only atomic values”)

  • There should NOT be multiple columns for the same type of data
    (AKA There are no “repeating groups”.)

The following tables are not in 1st normal form (1NF).

  • This table contains multiple values in a single cell which violates the rules of 1NF

    orders
    orderNum products
    ord100 3 apples, 2 plums
    ord200 5 apples, 3 pears, 10 plums
  • This table contains multiple columns for similar types of data. This also violates the rules of 1NF.

    orders
    orderNum product1 quantity1 product2 quantity2 product1.1 quantity3
    ord100 apples 3 plums 2 NULL NULL
    ord200 apples 5 pears 3 plums 10

39.2.1 Converting the data to 1NF

The following table contains the same data in first normal form:

::: {.cell} ::: {.cell-output-display}

orders
orderNum product quantity
ord100 apples 3
ord100 plums 2
ord200 apples 5
ord200 pears 3
ord200 plums 10

::: :::

39.3 Example - college class offerings

The data shown below about classes, sections and professors can be arranged in many different formats.

39.3.1 All data in a single table

For example, the following data could be in a single table. This makes it very easy for people to read the info.

classOfferings
className classNum credits section room professor office email tel
Intro to IDS IDS1020 3 231 F201 joe jones B102 joe@abc.edu (212)555-5555
Intro to IDS IDS1020 3 241 F201 joe jones B102 joe@abc.edu (212)555-5555
Intro to IDS IDS1020 3 331 F310 sue smith B231 sue@abc.edu (212)999-9999
Intro to IDS IDS1020 3 341 F310 sue smith B231 sue@abc.edu (212)999-9999
Database Design IDS2040 4 133 F250 joe jones B102 joe@abc.edu (212)555-5555

39.3.2 “Normalize” tables to reduce duplication of data

If you look closely at the data in the example above, you’ll notice that the details for each professor (i.e. name, office, email, tel) is duplicated in multiple rows. This makes it very easy for people to read the data. However, this is NOT a great way to store information in a relational database (keep reading).

Having the same data in more than one place in a database can lead to inconsistencies in the data. For example if a database records a person’s name in more than one place and the person changes their name, then all copies of the person’s name would need to be updated. If due to some miscalculation, some copies of the person’s name were updated and some weren’t updated, the database would be inconsistent. One of the major goals of a good database design is to eliminate redundant data. All data, other than primary keys and foreign keys, should be recorded in one and only one place in the database.

39.3.3 split professor data into a different table

The data shown above can be reorganized into the following structure to avoid duplicating each professor’s details.

offerings
className classNum credits section room prof_id
Intro to IDS IDS1020 3 231 F201 P100
Intro to IDS IDS1020 3 241 F201 P100
Intro to IDS IDS1020 3 331 F310 P200
Intro to IDS IDS1020 3 341 F310 P200
Database Design IDS2040 4 133 F250 P100
professors
prof_id first last office email tel
P100 joe jones B102 joe@abc.edu (212)555-5555
P200 sue smith B231 sue@abc.edu (212)999-9999

SQL SELECT to regenerate original “all in one” table

We are able to see the data as it was originally presented by using the following SQL query:

queryResults = 
  sqldf("
    select className, classNum, credits, section, room, first || ' ' || last as professor, office, email, tel
    from offerings join professors on offerings.prof_id = professors.prof_id
    order by classNum, section
  ")

kable(queryResults, caption = "queryResults")
queryResults
className classNum credits section room professor office email tel
Intro to IDS IDS1020 3 231 F201 joe jones B102 joe@abc.edu (212)555-5555
Intro to IDS IDS1020 3 241 F201 joe jones B102 joe@abc.edu (212)555-5555
Intro to IDS IDS1020 3 331 F310 sue smith B231 sue@abc.edu (212)999-9999
Intro to IDS IDS1020 3 341 F310 sue smith B231 sue@abc.edu (212)999-9999
Database Design IDS2040 4 133 F250 joe jones B102 joe@abc.edu (212)555-5555

39.3.4 Split common class info into a different table

The same data can be still further reorganized into the following structure to avoid duplicating the details for each class:

classes
classNum className credits
IDS1020 Intro to IDS 3
IDS2040 Database Design 4
offerings
crn classNum section room prof_id
12100 IDS1020 231 F201 P100
12100 IDS1020 241 F201 P100
12100 IDS1020 331 F310 P200
12100 IDS1020 341 F310 P200
14301 IDS2040 133 F250 P100
professors
prof_id first last office email tel
P100 joe jones B102 joe@abc.edu (212)555-5555
P200 sue smith B231 sue@abc.edu (212)999-9999

SQL SELECT to regenerate original “all in one” table

We are able to see the data as it was originally presented by using the following SQL query:

queryResults = 
  sqldf("
    select className, classes.classNum, credits, section, room, 
           first || ' ' || last as professor, office, email, tel
    from classes join offerings on classes.classNum = offerings.classNum 
                 join professors on offerings.prof_id = professors.prof_id
    order by classes.classNum, section
  ")

kable(queryResults, caption = "queryResults", booktabs=TRUE)
queryResults
className classNum credits section room professor office email tel
Intro to IDS IDS1020 3 231 F201 joe jones B102 joe@abc.edu (212)555-5555
Intro to IDS IDS1020 3 241 F201 joe jones B102 joe@abc.edu (212)555-5555
Intro to IDS IDS1020 3 331 F310 sue smith B231 sue@abc.edu (212)999-9999
Intro to IDS IDS1020 3 341 F310 sue smith B231 sue@abc.edu (212)999-9999
Database Design IDS2040 4 133 F250 joe jones B102 joe@abc.edu (212)555-5555

39.4 2NF and 3NF: Formal rules for how to split a table into multiple tables to avoid duplicated data

The previous section showed an example of how to split a single table into multiple tables in order to avoid duplication of data. We didn’t give formal rules, we just used our intuition for how to do so.

It’s usually pretty straight forward to arrange your data correctly to avoid data duplication. In general think about the the different types of “entities” that your data describes (e.g. courses, students, professors, etc). Then create a different table for each type of entity and relate the tables using PK and FK columns. However, it’s possible to miss obvious issues unless you carefully analyze the data.

This section introduces a formal set of rules for how to split up tables to avoid data duplication. To this end, we will focus on how to put tables into “Second Normal Form (2NF)” and “Third Normal Form (3NF)”. A prerequisite to understanding 2nd and 3rd normal forms is to understand what a “functional dependency (FD)” is.

39.4.1 What is a Functional Dependency (FD)

A functional dependency describes a relationship between columns (attributes) in a table.

A functional dependency is a relationship between the columns in a table where the values in one column (or a set of columns) uniquely determines the values in another column. If the value in column A determines the value in column B (denoted as A → B), then for any given value of A, there can only ever be exactly one value of B associated with it. For example, if StudentID → StudentName, then knowing a specific StudentID uniquely identifies the corresponding StudentName.

39.4.2 Functional Dependency Rules: X, Y → A, B, C

A functional dependency rule can be written in the following format where A,B,C,D are column names.

A -> B,C,D

  • The left-hand side (LHS) contains the determinant(s) — the column(s) that determine the values of other columns.

  • The right-hand side (RHS) contains the dependent column(s) — the values that are determined.

  • Both the left hand side and the right hand side can have any number of columns.

For example a functional dependency can also take a form similar to:

A,B -> C,D,E

Where the values in columns A and B taken together functionally determine the values of columns C,D and E.

39.4.3 Example Table: StudentCourse

StudentID FirstName LastName CourseID CourseName Credits Grade
001 Alice Smith C101 Intro to Math 3 A
002 Bob Jones C101 Intro to Math 3 B
001 Alice Smith C102 English 101 4 A-

The functional dependencies are:

  1. StudentID → FirstName, LastName

    • A student ID uniquely determines the student’s name.
  2. CourseID → CourseName, Credits

    • A course ID determines the course details.
  3. StudentID, CourseID → Grade

    • The composite key of student and course determines the grade.

39.5 Primary Key (PK), candidate PK, composite PK

We simplified the last functional dependency rule a little in order to highlight that StudentId and CourseId determine the Grade value.

However, if you look carefully you should notice that since SudentId determines all of the student related details and CourseId determines all of the Course details, and these are ALL of the columns in the table. We can write that as:

StudentID, CourseID → FirstName, LastName, CourseName, Credicts, Grade

Since this functional dependency rule includes all of the columns in the table, we say that StudentId, CourseId is a candidate primary key for the table. If there would be more than one candidate PK (there isn’t in this table) then one of the candidate PK’s would be chosen as THE primary key for the table.

A Primary Key that is the combination of more than one column is known as a composite primary key.

39.6 New example - to demo 2NF and 3NF

Let’s start with a new example.

This example also involves courses in a college. It is similar to the previous example but the data is slightly different. This new example contains information about students, their courses, grades, and detailed department information. Let’s normalize this table step by step.

The table is already in first normal form (see above).

# Create the initial non-normalized table
first_normal_form <- data.frame(
  StudentID = c("001", "002", "001", "003", "002"),
  FirstName = c("Alice", "Bob", "Alice", "Carol", "Bob"),
  LastName = c("Smith", "Jones", "Smith", "Davis", "Jones"),
  CourseID = c("C101", "C101", "C102", "C103", "C103"),
  CourseName = c("Intro to Math", "Intro to Math", "English 101", "Calculus I", "Calculus I"),
  Credits = c(3, 3, 4, 4, 4),
  DepartmentID = c("D01", "D01", "D02", "D01", "D01"),
  DepartmentName = c("Mathematics", "Mathematics", "English", "Mathematics", "Mathematics"),
  DepartmentHead = c("Dr. Johnson", "Dr. Johnson", "Dr. Williams", "Dr. Johnson", "Dr. Johnson"),
  DeptPhone = c("(555) 123-4567", "(555) 123-4567", "(555) 987-6543", "(555) 123-4567", "(555) 123-4567"),
  DeptOffice = c("Room A100", "Room A100", "Room B200", "Room A100", "Room A100"),
  Grade = c("A", "B", "A-", "B+", "C+"),
  stringsAsFactors = FALSE
)

# Display the non-normalized table
library(knitr)
kable(first_normal_form, caption = "registration table")
registration table
StudentID FirstName LastName CourseID CourseName Credits DepartmentID DepartmentName DepartmentHead DeptPhone DeptOffice Grade
001 Alice Smith C101 Intro to Math 3 D01 Mathematics Dr. Johnson (555) 123-4567 Room A100 A
002 Bob Jones C101 Intro to Math 3 D01 Mathematics Dr. Johnson (555) 123-4567 Room A100 B
001 Alice Smith C102 English 101 4 D02 English Dr. Williams (555) 987-6543 Room B200 A-
003 Carol Davis C103 Calculus I 4 D01 Mathematics Dr. Johnson (555) 123-4567 Room A100 B+
002 Bob Jones C103 Calculus I 4 D01 Mathematics Dr. Johnson (555) 123-4567 Room A100 C+

39.6.1 The functional dependencies and primary key

Based on our general understanding of how students register for courses in college, the following are the functional dependencies:

  1. StudentID → FirstName, LastName
    • A student ID uniquely determines the student’s name.
  2. CourseID → CourseName, Credits, DepartmentID
    • A course ID determines the course details.
  3. DepartmentID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice
    • A department ID determines the department details.
  4. StudentID, CourseID → Grade
    • The combination of student and course determines the grade.

The primary key is the composite key (StudentID, CourseID) since together they uniquely identify all values in each row.

39.7 Second Normal Form (2NF)

A table is in Second Normal Form (2NF) if:

  1. It is already in 1NF

  2. There are no “Partial Functional Dependencies (pFD)” (see below)

39.7.1 What is a “Partial Functional Dependency (pFD)” ?

A Partial Functional Dependency (pFD) is an FD for which the left hand side (LHS) is only part of the primary key (PK).

In our table, the primary key is the composite key (StudentID, CourseID)

Therefore the following functional dependencies are “partial functional dependencies” since the LHS of these FDs are only part the primary key.

  • StudentID → FirstName, LastName

  • CourseID → CourseName, Credits, DepartmentID

To achieve 2NF, we need to remove these partial dependencies by creating separate tables:

# Create Student table (2NF)
student_2nf <- first_normal_form %>%
  select(StudentID, FirstName, LastName) %>%
  distinct()

# Create StudentCourse table (2NF)
student_course_2nf <- first_normal_form %>%
  select(StudentID, CourseID, Grade)

# Create Course table (2NF but not yet in 3NF)
course_2nf <- first_normal_form %>%
  select(CourseID, CourseName, Credits, DepartmentID, DepartmentName, DepartmentHead, DeptPhone, DeptOffice) %>%
  distinct()

# Display the 2NF tables
kable(student_2nf, caption = "student table", booktabs=TRUE)
student table
StudentID FirstName LastName
001 Alice Smith
002 Bob Jones
003 Carol Davis
kable(student_course_2nf, caption = "registration table", booktabs=TRUE)
registration table
StudentID CourseID Grade
001 C101 A
002 C101 B
001 C102 A-
003 C103 B+
002 C103 C+
kable(course_2nf, caption = "course table (2NF but not yet in 3NF)", booktabs=TRUE)
course table (2NF but not yet in 3NF)
CourseID CourseName Credits DepartmentID DepartmentName DepartmentHead DeptPhone DeptOffice
C101 Intro to Math 3 D01 Mathematics Dr. Johnson (555) 123-4567 Room A100
C102 English 101 4 D02 English Dr. Williams (555) 987-6543 Room B200
C103 Calculus I 4 D01 Mathematics Dr. Johnson (555) 123-4567 Room A100

39.7.2 General rules for how to convert a table to 2NF

  1. For each partial functional dependency (pFD) create a new table that includes all of the columns from the pFD (choose a name for the new table)

  2. The PK for the new table will be those columns from the left hand side (LHS) of the pFD

  3. Remove from the original table those columns that were on the right hand side (RHS) of the pFD

  4. The columns from the LHS of the PFD will remain in the original table and become a foreign key (FK) to the new table.

39.8 Third Normal Form (3NF)

A table is in Third Normal Form (3NF) if:

  1. It is already in 2NF

  2. It has no “Transitive Functional Dependencies (tFD)” (see next section)

39.8.1 What is a “Transitive Functional Dependency (tFD)” ?

A transitive functional dependency occurs when there is an indirect relationship between attributes in a relation. Specifically, it happens when attribute A functionally determines attribute C through another attribute B. Formally, if:

  • A → B (A determines B) and
  • B → C (B determines C)

Then we say that

  • A transitively determines C
  • or in other words C is transitively dependent on A through B.

In our example since

  • CourseID → CourseName, Credits, DepartmentID
    i.e. CourseID determines DepartmentID

and

  • DepartmentID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice

we can infer through the logic of transitivity that

  • CourseID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice
    THIS IS A TRANSITIVE FUNCTIONAL DEPENDENCY

Notice that the primary key for this version of the courses table is JUST CourseID. This is true since CourseID directly or through indirectly through transitivity determines all columns in the table. Specifically:

  • CourseId directly determines CourseName, Credits, DepartmentID
  • CourseId indirectly (through DepartmentID) determines DepartmentName, DepartmentHead, DeptPhone, DeptOffice

Therefore the Primary Key (PK) of the courses table is JUST CourseId (not CourseId and DeptId). Therefore there is no “Partial Functional Dependency” in this table but there IS a “Transitive Functional Dependency” in this table.

39.8.2 Convert the Courses table to be in 3NF

Again, looking at the Course table, we notice a transitive dependency:

  • CourseID → DepartmentID (directly)
  • DepartmentID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice (directly)
  • Therefore, CourseID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice (transitively)

This is a classic transitive dependency where course details transitively determine department details through the DepartmentID. To achieve 3NF, we need to remove this transitive dependency from the course table.

# Create Course table (3NF)
course_3nf <- course_2nf %>%
  select(CourseID, CourseName, Credits, DepartmentID) %>%
  distinct()

# Create Department table (3NF)
department_3nf <- course_2nf %>%
  select(DepartmentID, DepartmentName, DepartmentHead, DeptPhone, DeptOffice) %>%
  distinct()

# Display the 3NF tables
kable(course_3nf, caption = "Course Table (3NF)", booktabs=TRUE)
Course Table (3NF)
CourseID CourseName Credits DepartmentID
C101 Intro to Math 3 D01
C102 English 101 4 D02
C103 Calculus I 4 D01
kable(department_3nf, caption = "Department Table (3NF)", booktabs=TRUE)
Department Table (3NF)
DepartmentID DepartmentName DepartmentHead DeptPhone DeptOffice
D01 Mathematics Dr. Johnson (555) 123-4567 Room A100
D02 English Dr. Williams (555) 987-6543 Room B200

39.9 The final design of all tables (i.e. the final table “schema”)

The final 3NF versions of the tables are shown below:

# Display the 3NF tables
kable(student_2nf, caption = "student table", booktabs=TRUE)
student table
StudentID FirstName LastName
001 Alice Smith
002 Bob Jones
003 Carol Davis
kable(student_course_2nf, caption = "student_course table", booktabs=TRUE)
student_course table
StudentID CourseID Grade
001 C101 A
002 C101 B
001 C102 A-
003 C103 B+
002 C103 C+
kable(course_3nf, caption = "course table", booktabs=TRUE)
course table
CourseID CourseName Credits DepartmentID
C101 Intro to Math 3 D01
C102 English 101 4 D02
C103 Calculus I 4 D01
kable(department_3nf, caption = "department table", booktabs=TRUE)
department table
DepartmentID DepartmentName DepartmentHead DeptPhone DeptOffice
D01 Mathematics Dr. Johnson (555) 123-4567 Room A100
D02 English Dr. Williams (555) 987-6543 Room B200

39.9.1 General rules for how to convert a table to 3NF

To convert a table to 3NF, given a table that contains the sets of columns A,B,C (where each of A,B,C represents one or more columns) and the following transitivity rules:

  • A -> B

  • B -> C

  1. Create a new table (give it a name) that contains all columns from B and C.

  2. The columns from B will become the primary key for the new table.

  3. Remove from the original table the columns represented by C

  4. The columns represented by B will remain in the original table and will become a foreign key to the new table.