orderNum | products |
---|---|
ord100 | 3 apples, 2 plums |
ord200 | 5 apples, 3 pears, 10 plums |
39 38. Database Normalization
The same data can be arranged in a database in many different ways. This section describes how to intelligently arrange your data in a relational database.
Database normalization is the process of designing database tables so that
they are in good form for manipulating with SQL (this the focus of “1st normal form” - see below)
there is minimal repetition of data (this is the focus of “2nd normal form” and “3rd normal form” - see below)
39.1 Normal Forms
Database normalization is accomplished in several steps. There are different “levels” of normalization, each level progressing towards a more perfect organization of the data. Each level is called a “Normal Form”. We say that a database is “in” a particular normal form when the tables are structured in a way that conforms to the rules of that normal form. This will all become much more clear as you progress through the material below. In a fully normalized database the only data that should be repeated are the primary/foreign keys.
The following are the various normal forms. You are not expected to understand the technical terms in the list below. These will be covered in the material that follows this section.
- 1NF (1st Normal Form) - only atomic values in columns, no repeating groups or arrays
- 2NF (2nd Normal Form) - no “partial functional dependencies (pFD)”
- 3NF (3rd Normal Form) - no “transitive functional dependencies (tFD)”
Most database designers aim to have their databases at least in 3rd Normal Form. However, the following additional normal forms are used to further “normalize” a database to deal with special cases.
In this material we will only be covering 1st, 2nd and 3rd normal forms. The list below is provided in case you want to do your own research.
- Bocye-Codd Normal Form (BCNF) - every determinant (attribute that determines other attributes) must be a candidate key
- 4NF - no multi-valued dependencies
- 5NF - tables cannot be losslessly decomposed into smaller tables
- 6NF - tables must be irreducible, containing no non-trivial join dependencies
39.2 First Normal Form (1NF)
First normal form requires that
There be only one value in each “cell” of a table.
(AKA “all attributes contain only atomic values”)There should NOT be multiple columns for the same type of data
(AKA There are no “repeating groups”.)
The following tables are not in 1st normal form (1NF).
This table contains multiple values in a single cell which violates the rules of 1NF
This table contains multiple columns for similar types of data. This also violates the rules of 1NF.
orders orderNum product1 quantity1 product2 quantity2 product1.1 quantity3 ord100 apples 3 plums 2 NULL NULL ord200 apples 5 pears 3 plums 10
39.2.1 Converting the data to 1NF
The following table contains the same data in first normal form:
::: {.cell} ::: {.cell-output-display}
orderNum | product | quantity |
---|---|---|
ord100 | apples | 3 |
ord100 | plums | 2 |
ord200 | apples | 5 |
ord200 | pears | 3 |
ord200 | plums | 10 |
::: :::
39.3 Example - college class offerings
The data shown below about classes, sections and professors can be arranged in many different formats.
39.3.1 All data in a single table
For example, the following data could be in a single table. This makes it very easy for people to read the info.
className | classNum | credits | section | room | professor | office | tel | |
---|---|---|---|---|---|---|---|---|
Intro to IDS | IDS1020 | 3 | 231 | F201 | joe jones | B102 | joe@abc.edu | (212)555-5555 |
Intro to IDS | IDS1020 | 3 | 241 | F201 | joe jones | B102 | joe@abc.edu | (212)555-5555 |
Intro to IDS | IDS1020 | 3 | 331 | F310 | sue smith | B231 | sue@abc.edu | (212)999-9999 |
Intro to IDS | IDS1020 | 3 | 341 | F310 | sue smith | B231 | sue@abc.edu | (212)999-9999 |
Database Design | IDS2040 | 4 | 133 | F250 | joe jones | B102 | joe@abc.edu | (212)555-5555 |
39.3.2 “Normalize” tables to reduce duplication of data
If you look closely at the data in the example above, you’ll notice that the details for each professor (i.e. name, office, email, tel) is duplicated in multiple rows. This makes it very easy for people to read the data. However, this is NOT a great way to store information in a relational database (keep reading).
Having the same data in more than one place in a database can lead to inconsistencies in the data. For example if a database records a person’s name in more than one place and the person changes their name, then all copies of the person’s name would need to be updated. If due to some miscalculation, some copies of the person’s name were updated and some weren’t updated, the database would be inconsistent. One of the major goals of a good database design is to eliminate redundant data. All data, other than primary keys and foreign keys, should be recorded in one and only one place in the database.
39.3.3 split professor data into a different table
The data shown above can be reorganized into the following structure to avoid duplicating each professor’s details.
className | classNum | credits | section | room | prof_id |
---|---|---|---|---|---|
Intro to IDS | IDS1020 | 3 | 231 | F201 | P100 |
Intro to IDS | IDS1020 | 3 | 241 | F201 | P100 |
Intro to IDS | IDS1020 | 3 | 331 | F310 | P200 |
Intro to IDS | IDS1020 | 3 | 341 | F310 | P200 |
Database Design | IDS2040 | 4 | 133 | F250 | P100 |
prof_id | first | last | office | tel | |
---|---|---|---|---|---|
P100 | joe | jones | B102 | joe@abc.edu | (212)555-5555 |
P200 | sue | smith | B231 | sue@abc.edu | (212)999-9999 |
SQL SELECT to regenerate original “all in one” table
We are able to see the data as it was originally presented by using the following SQL query:
=
queryResults sqldf("
select className, classNum, credits, section, room, first || ' ' || last as professor, office, email, tel
from offerings join professors on offerings.prof_id = professors.prof_id
order by classNum, section
")
kable(queryResults, caption = "queryResults")
className | classNum | credits | section | room | professor | office | tel | |
---|---|---|---|---|---|---|---|---|
Intro to IDS | IDS1020 | 3 | 231 | F201 | joe jones | B102 | joe@abc.edu | (212)555-5555 |
Intro to IDS | IDS1020 | 3 | 241 | F201 | joe jones | B102 | joe@abc.edu | (212)555-5555 |
Intro to IDS | IDS1020 | 3 | 331 | F310 | sue smith | B231 | sue@abc.edu | (212)999-9999 |
Intro to IDS | IDS1020 | 3 | 341 | F310 | sue smith | B231 | sue@abc.edu | (212)999-9999 |
Database Design | IDS2040 | 4 | 133 | F250 | joe jones | B102 | joe@abc.edu | (212)555-5555 |
39.3.4 Split common class info into a different table
The same data can be still further reorganized into the following structure to avoid duplicating the details for each class:
classNum | className | credits |
---|---|---|
IDS1020 | Intro to IDS | 3 |
IDS2040 | Database Design | 4 |
crn | classNum | section | room | prof_id |
---|---|---|---|---|
12100 | IDS1020 | 231 | F201 | P100 |
12100 | IDS1020 | 241 | F201 | P100 |
12100 | IDS1020 | 331 | F310 | P200 |
12100 | IDS1020 | 341 | F310 | P200 |
14301 | IDS2040 | 133 | F250 | P100 |
prof_id | first | last | office | tel | |
---|---|---|---|---|---|
P100 | joe | jones | B102 | joe@abc.edu | (212)555-5555 |
P200 | sue | smith | B231 | sue@abc.edu | (212)999-9999 |
SQL SELECT to regenerate original “all in one” table
We are able to see the data as it was originally presented by using the following SQL query:
=
queryResults sqldf("
select className, classes.classNum, credits, section, room,
first || ' ' || last as professor, office, email, tel
from classes join offerings on classes.classNum = offerings.classNum
join professors on offerings.prof_id = professors.prof_id
order by classes.classNum, section
")
kable(queryResults, caption = "queryResults", booktabs=TRUE)
className | classNum | credits | section | room | professor | office | tel | |
---|---|---|---|---|---|---|---|---|
Intro to IDS | IDS1020 | 3 | 231 | F201 | joe jones | B102 | joe@abc.edu | (212)555-5555 |
Intro to IDS | IDS1020 | 3 | 241 | F201 | joe jones | B102 | joe@abc.edu | (212)555-5555 |
Intro to IDS | IDS1020 | 3 | 331 | F310 | sue smith | B231 | sue@abc.edu | (212)999-9999 |
Intro to IDS | IDS1020 | 3 | 341 | F310 | sue smith | B231 | sue@abc.edu | (212)999-9999 |
Database Design | IDS2040 | 4 | 133 | F250 | joe jones | B102 | joe@abc.edu | (212)555-5555 |
39.4 2NF and 3NF: Formal rules for how to split a table into multiple tables to avoid duplicated data
The previous section showed an example of how to split a single table into multiple tables in order to avoid duplication of data. We didn’t give formal rules, we just used our intuition for how to do so.
It’s usually pretty straight forward to arrange your data correctly to avoid data duplication. In general think about the the different types of “entities” that your data describes (e.g. courses, students, professors, etc). Then create a different table for each type of entity and relate the tables using PK and FK columns. However, it’s possible to miss obvious issues unless you carefully analyze the data.
This section introduces a formal set of rules for how to split up tables to avoid data duplication. To this end, we will focus on how to put tables into “Second Normal Form (2NF)” and “Third Normal Form (3NF)”. A prerequisite to understanding 2nd and 3rd normal forms is to understand what a “functional dependency (FD)” is.
39.4.1 What is a Functional Dependency (FD)
A functional dependency describes a relationship between columns (attributes) in a table.
A functional dependency is a relationship between the columns in a table where the values in one column (or a set of columns) uniquely determines the values in another column. If the value in column A determines the value in column B (denoted as A → B), then for any given value of A, there can only ever be exactly one value of B associated with it. For example, if StudentID → StudentName, then knowing a specific StudentID uniquely identifies the corresponding StudentName.
39.4.2 Functional Dependency Rules: X, Y → A, B, C
A functional dependency rule can be written in the following format where A,B,C,D are column names.
A -> B,C,D
The left-hand side (LHS) contains the determinant(s) — the column(s) that determine the values of other columns.
The right-hand side (RHS) contains the dependent column(s) — the values that are determined.
Both the left hand side and the right hand side can have any number of columns.
For example a functional dependency can also take a form similar to:
A,B -> C,D,E
Where the values in columns A and B taken together functionally determine the values of columns C,D and E.
39.4.3 Example Table: StudentCourse
StudentID | FirstName | LastName | CourseID | CourseName | Credits | Grade |
---|---|---|---|---|---|---|
001 | Alice | Smith | C101 | Intro to Math | 3 | A |
002 | Bob | Jones | C101 | Intro to Math | 3 | B |
001 | Alice | Smith | C102 | English 101 | 4 | A- |
The functional dependencies are:
StudentID → FirstName, LastName
- A student ID uniquely determines the student’s name.
CourseID → CourseName, Credits
- A course ID determines the course details.
StudentID, CourseID → Grade
- The composite key of student and course determines the grade.
39.5 Primary Key (PK), candidate PK, composite PK
We simplified the last functional dependency rule a little in order to highlight that StudentId and CourseId determine the Grade value.
However, if you look carefully you should notice that since SudentId determines all of the student related details and CourseId determines all of the Course details, and these are ALL of the columns in the table. We can write that as:
StudentID, CourseID → FirstName, LastName, CourseName, Credicts, Grade
Since this functional dependency rule includes all of the columns in the table, we say that StudentId, CourseId
is a candidate primary key for the table. If there would be more than one candidate PK (there isn’t in this table) then one of the candidate PK’s would be chosen as THE primary key for the table.
A Primary Key that is the combination of more than one column is known as a composite primary key.
39.6 New example - to demo 2NF and 3NF
Let’s start with a new example.
This example also involves courses in a college. It is similar to the previous example but the data is slightly different. This new example contains information about students, their courses, grades, and detailed department information. Let’s normalize this table step by step.
The table is already in first normal form (see above).
# Create the initial non-normalized table
<- data.frame(
first_normal_form StudentID = c("001", "002", "001", "003", "002"),
FirstName = c("Alice", "Bob", "Alice", "Carol", "Bob"),
LastName = c("Smith", "Jones", "Smith", "Davis", "Jones"),
CourseID = c("C101", "C101", "C102", "C103", "C103"),
CourseName = c("Intro to Math", "Intro to Math", "English 101", "Calculus I", "Calculus I"),
Credits = c(3, 3, 4, 4, 4),
DepartmentID = c("D01", "D01", "D02", "D01", "D01"),
DepartmentName = c("Mathematics", "Mathematics", "English", "Mathematics", "Mathematics"),
DepartmentHead = c("Dr. Johnson", "Dr. Johnson", "Dr. Williams", "Dr. Johnson", "Dr. Johnson"),
DeptPhone = c("(555) 123-4567", "(555) 123-4567", "(555) 987-6543", "(555) 123-4567", "(555) 123-4567"),
DeptOffice = c("Room A100", "Room A100", "Room B200", "Room A100", "Room A100"),
Grade = c("A", "B", "A-", "B+", "C+"),
stringsAsFactors = FALSE
)
# Display the non-normalized table
library(knitr)
kable(first_normal_form, caption = "registration table")
StudentID | FirstName | LastName | CourseID | CourseName | Credits | DepartmentID | DepartmentName | DepartmentHead | DeptPhone | DeptOffice | Grade |
---|---|---|---|---|---|---|---|---|---|---|---|
001 | Alice | Smith | C101 | Intro to Math | 3 | D01 | Mathematics | Dr. Johnson | (555) 123-4567 | Room A100 | A |
002 | Bob | Jones | C101 | Intro to Math | 3 | D01 | Mathematics | Dr. Johnson | (555) 123-4567 | Room A100 | B |
001 | Alice | Smith | C102 | English 101 | 4 | D02 | English | Dr. Williams | (555) 987-6543 | Room B200 | A- |
003 | Carol | Davis | C103 | Calculus I | 4 | D01 | Mathematics | Dr. Johnson | (555) 123-4567 | Room A100 | B+ |
002 | Bob | Jones | C103 | Calculus I | 4 | D01 | Mathematics | Dr. Johnson | (555) 123-4567 | Room A100 | C+ |
39.6.1 The functional dependencies and primary key
Based on our general understanding of how students register for courses in college, the following are the functional dependencies:
StudentID → FirstName, LastName
- A student ID uniquely determines the student’s name.
CourseID → CourseName, Credits, DepartmentID
- A course ID determines the course details.
DepartmentID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice
- A department ID determines the department details.
StudentID, CourseID → Grade
- The combination of student and course determines the grade.
The primary key is the composite key (StudentID, CourseID)
since together they uniquely identify all values in each row.
39.7 Second Normal Form (2NF)
A table is in Second Normal Form (2NF) if:
It is already in 1NF
There are no “Partial Functional Dependencies (pFD)” (see below)
39.7.1 What is a “Partial Functional Dependency (pFD)” ?
A Partial Functional Dependency (pFD) is an FD for which the left hand side (LHS) is only part of the primary key (PK).
In our table, the primary key is the composite key (StudentID, CourseID)
Therefore the following functional dependencies are “partial functional dependencies” since the LHS of these FDs are only part the primary key.
StudentID → FirstName, LastName
CourseID → CourseName, Credits, DepartmentID
To achieve 2NF, we need to remove these partial dependencies by creating separate tables:
# Create Student table (2NF)
<- first_normal_form %>%
student_2nf select(StudentID, FirstName, LastName) %>%
distinct()
# Create StudentCourse table (2NF)
<- first_normal_form %>%
student_course_2nf select(StudentID, CourseID, Grade)
# Create Course table (2NF but not yet in 3NF)
<- first_normal_form %>%
course_2nf select(CourseID, CourseName, Credits, DepartmentID, DepartmentName, DepartmentHead, DeptPhone, DeptOffice) %>%
distinct()
# Display the 2NF tables
kable(student_2nf, caption = "student table", booktabs=TRUE)
StudentID | FirstName | LastName |
---|---|---|
001 | Alice | Smith |
002 | Bob | Jones |
003 | Carol | Davis |
kable(student_course_2nf, caption = "registration table", booktabs=TRUE)
StudentID | CourseID | Grade |
---|---|---|
001 | C101 | A |
002 | C101 | B |
001 | C102 | A- |
003 | C103 | B+ |
002 | C103 | C+ |
kable(course_2nf, caption = "course table (2NF but not yet in 3NF)", booktabs=TRUE)
CourseID | CourseName | Credits | DepartmentID | DepartmentName | DepartmentHead | DeptPhone | DeptOffice |
---|---|---|---|---|---|---|---|
C101 | Intro to Math | 3 | D01 | Mathematics | Dr. Johnson | (555) 123-4567 | Room A100 |
C102 | English 101 | 4 | D02 | English | Dr. Williams | (555) 987-6543 | Room B200 |
C103 | Calculus I | 4 | D01 | Mathematics | Dr. Johnson | (555) 123-4567 | Room A100 |
39.7.2 General rules for how to convert a table to 2NF
For each partial functional dependency (pFD) create a new table that includes all of the columns from the pFD (choose a name for the new table)
The PK for the new table will be those columns from the left hand side (LHS) of the pFD
Remove from the original table those columns that were on the right hand side (RHS) of the pFD
The columns from the LHS of the PFD will remain in the original table and become a foreign key (FK) to the new table.
39.8 Third Normal Form (3NF)
A table is in Third Normal Form (3NF) if:
It is already in 2NF
It has no “Transitive Functional Dependencies (tFD)” (see next section)
39.8.1 What is a “Transitive Functional Dependency (tFD)” ?
A transitive functional dependency occurs when there is an indirect relationship between attributes in a relation. Specifically, it happens when attribute A functionally determines attribute C through another attribute B. Formally, if:
- A → B (A determines B) and
- B → C (B determines C)
Then we say that
- A transitively determines C
- or in other words C is transitively dependent on A through B.
In our example since
CourseID → CourseName, Credits, DepartmentID
i.e. CourseID determines DepartmentID
and
DepartmentID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice
we can infer through the logic of transitivity that
CourseID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice
THIS IS A TRANSITIVE FUNCTIONAL DEPENDENCY
Notice that the primary key for this version of the courses table is JUST CourseID
. This is true since CourseID directly or through indirectly through transitivity determines all columns in the table. Specifically:
CourseId
directly determinesCourseName, Credits, DepartmentID
CourseId
indirectly (through DepartmentID) determinesDepartmentName, DepartmentHead, DeptPhone, DeptOffice
Therefore the Primary Key (PK) of the courses table is JUST CourseId
(not CourseId and DeptId). Therefore there is no “Partial Functional Dependency” in this table but there IS a “Transitive Functional Dependency” in this table.
39.8.2 Convert the Courses table to be in 3NF
Again, looking at the Course table, we notice a transitive dependency:
CourseID → DepartmentID
(directly)DepartmentID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice
(directly)- Therefore,
CourseID → DepartmentName, DepartmentHead, DeptPhone, DeptOffice
(transitively)
This is a classic transitive dependency where course details transitively determine department details through the DepartmentID. To achieve 3NF, we need to remove this transitive dependency from the course table.
# Create Course table (3NF)
<- course_2nf %>%
course_3nf select(CourseID, CourseName, Credits, DepartmentID) %>%
distinct()
# Create Department table (3NF)
<- course_2nf %>%
department_3nf select(DepartmentID, DepartmentName, DepartmentHead, DeptPhone, DeptOffice) %>%
distinct()
# Display the 3NF tables
kable(course_3nf, caption = "Course Table (3NF)", booktabs=TRUE)
CourseID | CourseName | Credits | DepartmentID |
---|---|---|---|
C101 | Intro to Math | 3 | D01 |
C102 | English 101 | 4 | D02 |
C103 | Calculus I | 4 | D01 |
kable(department_3nf, caption = "Department Table (3NF)", booktabs=TRUE)
DepartmentID | DepartmentName | DepartmentHead | DeptPhone | DeptOffice |
---|---|---|---|---|
D01 | Mathematics | Dr. Johnson | (555) 123-4567 | Room A100 |
D02 | English | Dr. Williams | (555) 987-6543 | Room B200 |
39.9 The final design of all tables (i.e. the final table “schema”)
The final 3NF versions of the tables are shown below:
# Display the 3NF tables
kable(student_2nf, caption = "student table", booktabs=TRUE)
StudentID | FirstName | LastName |
---|---|---|
001 | Alice | Smith |
002 | Bob | Jones |
003 | Carol | Davis |
kable(student_course_2nf, caption = "student_course table", booktabs=TRUE)
StudentID | CourseID | Grade |
---|---|---|
001 | C101 | A |
002 | C101 | B |
001 | C102 | A- |
003 | C103 | B+ |
002 | C103 | C+ |
kable(course_3nf, caption = "course table", booktabs=TRUE)
CourseID | CourseName | Credits | DepartmentID |
---|---|---|---|
C101 | Intro to Math | 3 | D01 |
C102 | English 101 | 4 | D02 |
C103 | Calculus I | 4 | D01 |
kable(department_3nf, caption = "department table", booktabs=TRUE)
DepartmentID | DepartmentName | DepartmentHead | DeptPhone | DeptOffice |
---|---|---|---|---|
D01 | Mathematics | Dr. Johnson | (555) 123-4567 | Room A100 |
D02 | English | Dr. Williams | (555) 987-6543 | Room B200 |
39.9.1 General rules for how to convert a table to 3NF
To convert a table to 3NF, given a table that contains the sets of columns A,B,C (where each of A,B,C represents one or more columns) and the following transitivity rules:
A -> B
B -> C
Create a new table (give it a name) that contains all columns from B and C.
The columns from B will become the primary key for the new table.
Remove from the original table the columns represented by C
The columns represented by B will remain in the original table and will become a foreign key to the new table.