27  27. “databases” and “database management systems”

A “database (DB)” is a collection of data that can be accessed by a computer programs (see below) in an accurate and efficient way.

A “Database Management System (DBMS)” is the software that is used to enable the storage and retrieval of the data in a database.

27.1 Common features of databases

Most databases and DBMS software provide the following features:

  • Able to be contolled by other software - Databases and DBMS software are usually used as part of a larger software system. Software that is coded in programming languages such as R, Python, Java, etc. have the ability to interact with a DBMS in order to store an retrieve data from a database.

    For example, websites that manipulate a lot of data generally consist of a “web server software program” that users communicate with. The “web server software” program in turn communicates with a DBMS in order to store and retrieve data for each user. For example, facebook.com makes use of one or more databases to store and retrieve messages, friend information, advertisements, etc. The web server program retrieves the information it needs from a database and uses that information to create a web page that is then returned to the user and displayed in the browser.

  • Reliable and Accurate Data – Databases have built-in rules to keep data correct and consistent. They prevent mistakes like duplicate entries or missing information by ensuring that all data follows set guidelines.

  • Multi User Access – Databases allow many people to access and update data at the same time without causing errors or losing information. They use special features to keep everything organized, even when multiple users make changes at once.

  • Handles Tremendous Amounts of Data – Databases are designed to store and manage huge amounts of data efficiently. They remain fast and responsive even as the amount of data grows.

  • Security – Databases have security settings to control who can see or change the data.

  • Reliability - Databases have safeguards to ensure that changes are properly saved and that information isn’t lost, even if something goes wrong (e.g. a blackout, earthquake, etc.).

Note that Excel is not usually considered a “database”. Excel is FANTASTIC as a tool for quickly gaining insights into data. However, Excel does not have most of the features listed above that describe typical databases.

27.2 Excel vs databases: what is the difference?

Excel and relational databases both organize information into rows and columns, so they can look similar at first. But they are designed for different goals.

  • Excel is primarily person-first software: A human usually opens a workbook, edits cells, adds formulas, and builds charts.

  • Databases are primarily application-first systems: A software program (website, app, API, report process, etc.) usually sends commands to a DBMS to read and write data.

In other words, Excel is an excellent analysis and presentation tool, while a database is a better long-term system for storing and serving data reliably.

27.2.1 Quick comparison

Topic Excel Relational Database (DBMS)
Typical use Interactive analysis by people Data storage/retrieval for software + teams
Main interface Spreadsheet UI SQL (Structured Query Language) + admin tools
Data size Great for small-to-medium data Designed for very large data volumes
Number of users editing at once Limited and can get messy Built for many concurrent users safely
Validation rules Basic validation options (useful, but limited) Strong constraints/rules to enforce quality
Structure Workbook with sheets Database with multiple related tables
How related data is organized Often spread across multiple sheets, with manual links Related tables are connected with keys, reducing duplication
Where data usually lives Usually a file (local drive/shared folder/cloud file app) Usually a database server (on a network or in the cloud)
Repeatability of data tasks Heavily dependent on workbook formulas and manual steps Queries/scripts can be saved and rerun consistently
Reliability Easier to break accidentally Transaction and recovery features reduce data loss
Security File-level sharing controls Fine-grained user/role permissions

27.2.2 Why this matters in practice

  1. Scale
    Excel can become slow and difficult to manage as data grows. Databases are engineered to keep performance reasonable even with large datasets.

  2. Data quality
    Excel does support validation (for example, requiring GPA to be between 0 and 4). But database systems provide stronger, centralized rules (types, keys, constraints, relationships) so all applications follow the same standards.

  3. Many people working at once
    Spreadsheets can be hard to coordinate when many people update data. Databases are designed for concurrent access and include locking/transaction logic to prevent collisions and inconsistent results.

  4. Program integration
    A DBMS is designed to be queried by programs using SQL. This makes databases a better fit for websites and business systems that must continuously read and update data.

  5. Relationships between topics
    In many real systems, data naturally spans multiple related topics (for example: students, courses, and grades). Databases are built to connect these topics cleanly across tables, while spreadsheets often require more manual linking across tabs.

  6. Consistency over time
    Spreadsheet workflows can drift over time (formula edits, copied tabs, and one-off manual fixes). Database queries and rules are easier to standardize, so teams are more likely to get the same result each time.

27.2.3 A helpful rule of thumb

  • Use Excel when the goal is quick exploration, ad hoc analysis, and visualization by a person or a small team.
  • Use a database when data must be shared across applications, edited by many users, protected by strong rules, and stored reliably over time.

27.3 Architecture of software that uses a database

Many modern software systems are “multi-user” systems. That means lots of users can use the same system at the same time.

Examples include:

  • social media sites (e.g. Facebook)
  • shopping sites (e.g. Amazon)
  • school systems (e.g. a university course registration system)
  • banking and payment systems

Even though these systems can look very different, they usually share a similar high-level architecture:

  1. A client (usually a browser or mobile app) sends a request.
  2. A web server program receives that request and runs business logic.
  3. The web server program asks a database server program (DBMS) for data.
  4. The DBMS reads/writes data in the database and returns results.
  5. The web server program sends a response back to the client.

27.3.1 One concrete example: university course registration

Suppose a student logs in and searches for open courses in Fall 2026.

  • The student clicks “Search” in a browser.
  • The web server receives the request (for example, with filters like department, course level, and meeting time).
  • The web server sends a SQL query to the DBMS.
  • The DBMS returns matching course records, seat counts, and instructor details.
  • The web server formats that information and sends an HTML/JSON response to the browser.
  • The student sees a list of available courses.

When the student clicks “Enroll”, another request is sent and the same architecture is used again. The DBMS updates enrollment data and the web server returns a success/failure message.

flowchart TD
    U[Student using browser or app]
    W[Web server program: receives requests, runs business logic, returns responses]
    D[(Database server program: stores and retrieves data about courses, sections, seats, students, and enrollments)]

    U -->|Step 1: request| W
    W -->|Step 2: SQL query| D
    D -->|Step 3: results/status| W
    W -->|Step 4: response| U

Where the numbered arrows mean:

  1. Request from client to web server
    The student submits an action in the browser/app (for example, search or enroll).

  2. Web server asks the DBMS for data/action
    The web server sends a SQL command to retrieve data or update records.

  3. DBMS returns results to web server
    The DBMS returns matching rows or update status (success/failure).

  4. Web server sends final response to client
    The web server formats the result and returns it to the browser/app.

27.3.2 “Single computer” mental model vs. real deployments

For learning, it is perfectly fine to imagine the web server program and DBMS running on one computer. That simple model helps explain the core idea.

In real production systems, organizations often use technical strategies (that we will not cover here) to run multiple web servers and multiple database servers for scale, reliability, and fault tolerance.

So the simple model is still conceptually correct; real systems are usually larger versions of the same pattern.

27.4 Different categories of databases

There are different types of databases and Database Management System (DBMS) software. These types can be categorized by how their data is structured.

Relational databases (sometimes referred to as “SQL” databases) arrange data in “tables” that are comprised of “rows” and “columns” (similar to an R dataframe or an Excel spreadsheet).

Non-Relational databases (often referred to as “NoSQL” databases) structure data in other forms (e.g. JSON or graphs - see below)

27.5 What is a “Relational Database”

Relational databases store information in “tables” that have rows and columns of information - very similar to the way the data is arranged in an R dataframe. A single relational database is a collection of one or more tables. Each column of a relational database table (a) has a name (b) has a datatype (e.g. numeric, logical, character, etc)

There are MANY different relational database products in use today. Some very popular ones are:

  • Postgres
  • MySql
  • MariaDb
  • Microsoft SQL Server
  • Oracle Relational Database Management System
  • SQLite
  • many, many others

27.6 What is SQL (often pronounced “sequel”)

Relational database software uses Structured Query Language (SQL) to manipulate database contents. SQL commands are categorized as follows:

Relational database software uses the Structured Query Language (SQL) (often pronounced “sequel”) to manipulate the contents of the database. Commands in the SQL language are often grouped into the following categories:

(For this class, we will focus on the first two categories listed below, i.e. DDL and DML)

Category Commands
DDL (Data Definition Language) Commands
Defines and modifies database structure.
CREATE DATABASE – Creates a new database.
CREATE TABLE – Creates a new table in an existing database.
DROP TABLE – Removes an entire table from a database.
DML (Data Manipulation Language) Commands
Manipulates data within tables.
SELECT – retrieves data from database tables. The SELECT statement has many different options that allow the data to be retireved in very customized formats and arrangements. E.g. SELECT can be used to summarize data, retrieve only data that has specific attributes, display data in specific order, etc.
INSERT – Adds new rows to a table.
DELETE – Removes rows from a table.
UPDATE – Modifies existing data in a table.
DCL (Data Control Language) Commands
Manages user permissions.
GRANT – Give permission to specific users to access specific data in the database.
REVOKE – Remove permission from specific users to access specific info in the database.
TCL (Transaction Control Language) Commands
Transactions ensure that a group of operations either all succeed or none take effect. If a failure occurs, the system ensures consistency using transaction control commands. (* see below for more info about transactions)

START TRANSACTION – Begins a transaction.
COMMIT – Saves changes permanently.
ROLLBACK – Reverts changes if needed.
SAVEPOINT – Creates checkpoints within a transaction.

(* transactions explained more fully): Most Relational databases offer “transaction control”. A transaction is a set of operations that are supposed to either all happen as a group or not happen at all. For example, when processing a money transfer between two bank accounts, the following steps must happen:

Step 1: remove money from account 1
Step 2: deposit money in account 2

These two steps are viewed as a “transaction”. Either both steps should happen - in which case the transaction was successful - or neither step should happen. Imagine that after the computer did Step 1, the computer crashes (e.g. due to a blackout) before it had a chance to do Step 2. Ideally, when the computer is restarted, either, Step 2 should happen or Step 1 should be “undone”. Most Relational Database Management Software contains features that ensure that this logic happens correctly.

(For this class, we will mostly focus on the DDL an DML categories of commands. Most of our time will be spent studying the numerous features of the SELECT statement)

27.7 Different “flavors” of SQL

Each relational database software product has its own “flavor” of the SQL language. You can think of this as different “dialects” of the same language. For example the difference between English as it’s used in the USA vs as it is used in England.

ANSI (American National Standards Institute) publishes a SQL Standard that is generally recognized by all relational database software programs.

However, most relational database products add additional capabilities and commands to its version of SQL to differentiate it from other relational database products in the market.

In this class we will be working with the SQLite version of SQL. However, we will try to focus primarily on the standard SQL features that are included in all relational database products and not on the extensions that are provided only by SQLite. When appropriate, we will try to highlight differences between the way that SQLite and other RDBMS products work.

The ANSI standard was first introduced in 1986 and has been reissued every few years since then. The following gives a flavor of features that were introduced or “deprecated” (keep reading) in different versions of the ANSI standard. To “deprecate” a feature means to recommend that it not be used as the feature might actually be removed in a future version.

Info about ANSI SQL standard
SQL Version Year Key Features Introduced Deprecated Features
SQL-86 (SQL-87) 1986/1987 First ANSI SQL standard, basic SQL syntax (SELECT, INSERT, UPDATE, DELETE), table creation, constraints None
SQL-89 1989 GRANT and REVOKE for access control None
SQL-92 1992 Expanded data types (DATE, TIME, INTERVAL), subqueries, joins (INNER JOIN, LEFT JOIN, RIGHT JOIN), string pattern matching (LIKE) Some legacy syntax
SQL:1999 1999 Recursive queries (WITH RECURSIVE), OLAP operations (ROLLUP, CUBE), triggers, procedural constructs (CASE, IF, LOOP), user-defined types (UDTs) Certain outdated syntax elements
SQL:2003 2003 Window functions (OVER, PARTITION BY), XML support, auto-increment (IDENTITY, GENERATED), MERGE statement Outdated join syntax
SQL:2006 2006 Full XML data handling, querying, and updates None
SQL:2008 2008 TRUNCATE TABLE, FETCH FIRST (limit control), improvements to MERGE None
SQL:2011 2011 Temporal tables (SYSTEM VERSIONING for time-travel queries) None
SQL:2016 2016 JSON support (JSON_VALUE, JSON_QUERY), row pattern recognition (MATCH_RECOGNIZE) None
SQL:2019 2019 Polymorphic table functions, new JSON functions and enhancements None

27.8 Non-Relational Databases (AKA NoSQL databases)

NoSQL (or non-relational) databases organize information in formats other than rows and columns.

SOME NoSql databases actually CAN be manipulated with the SQL language. However, the term NoSql has come to include any database that stores information in a form other than in strict rows and columns.

Some examples are shown below.

27.9 Document databases - e.g. MongoDB

MongoDb is an example of Database Management software that stores information in “JSON” notation. Other similar products include CouchDB, Firebase and Firestore. We will explore more about JSON later in the semester. However as an example, the above information could be represented in a JSON file as shown below:

           [
             {
               "name" : "annie",
               "friends" : [ "sam", "jack" ]
             },
             {
               "name" : "jack",
               "friends" : [ "annie" , "mac" , "doug" , "harry" ]
             },
             {
               "name" : "harry",
               "friends" : [ "jack" ]
             },
             {
               "name" : "sam",
               "friends" : [ "annie" ]
             },
             {
               "name" : "mac",
               "friends" : [ "jack" ]
             },
             {
               "name" : "harry",
               "friends" : [ "jack" ]
             },
             {
               "name" : "doug",
               "friends" : [ "jack" ]
             }
           ]

27.10 Graph databases - e.g. Neo4J

Neo4J is an example of a “graph database”. These types of databases organize their information in a “graph” structure. A graph looks like a “web” of information. It is comprised of “nodes” that contain data which are connected via “edges” that represent relationships between the nodes.

For example:

graph TD
    %% Position nodes to match example layout
    sam([sam])
    annie([annie])
    mac([mac])
    jack([jack])
    doug([doug])
    harry([harry])
    
    %% Create bidirectional connections with straight lines
    annie <-->|friend| sam
    jack <-->|friend| annie
    jack <-->|friend| mac
    jack <-->|friend| doug
    jack <-->|friend| harry
    
    %% Black and white styling
    classDef person fill:white,stroke:black,stroke-width:1px;
    class sam,annie,mac,jack,doug,harry person;
    
    %% Force specific layout
    sam ~~~ annie
    mac ~~~ jack ~~~ doug
    jack ~~~ harry
    
    %% Make invisible linkage lines transparent
    linkStyle 5,6,7 stroke:none;

This example shows a set of people and who is friends with whom. There is a different icon for each person. “Friend arrows” indicate who is friends with whom.

For many types of information a diagram as shown above is much more intuitive and understandable than a table of information. This is just one reason why graph databases are popular.

However, this same information could be converted to the following table and stored in a relational database.

            name      is_friends_with
            ----      ---------------
            sam       annie
            annie     sam
            annie     jack
            jack      annie
            jack      mac
            jack      doug
            jack      harry
            doug      jack
            harry     jack
            mac       jack

The following is version of the graph also includes “company” nodes and relationships indicating where people work. In general, graphs can be quite complex and be used to store all sorts of information.

graph LR
    %% Define nodes with type labels
    sam([Person: sam])
    annie([Person: annie])
    mac([Person: mac])
    jack([Person: jack])
    doug([Person: doug])
    harry([Person: harry])
    Acme([Company: Acme Inc.])
    MikroTek([Company: MikroTek])
    
    %% Friend relationships (bidirectional)
    annie <-->|friend| sam
    jack <-->|friend| annie
    jack <-->|friend| mac
    jack <-->|friend| doug
    jack <-->|friend| harry
    
    %% Work relationships
    sam -->|works at| Acme
    annie -->|works at| Acme
    mac -->|works at| MikroTek
    jack -->|works at| MikroTek
    doug -->|works at| MikroTek
    
    %% Styling for nodes - white background with black borders and text
    classDef person fill:white,stroke:black,stroke-width:1px,color:brown;
    classDef company fill:white,stroke:black,stroke-width:2px,stroke-dasharray:5 2,color:green;
    
    %% Apply styles
    class sam,annie,mac,jack,doug,harry person;
    class Acme,MikroTek company;

27.11 Other types of NoSQL databases

The are other types of NoSQL databases. The following are also popular. We will not go into these in more detail now.

The main thing that NoSQL databases have in common is that the data is NOT stored in a rectangular form (i.e. rows and columns).

  • Key-Value Stores
    • Store data as a collection of key-value pairs.
    • Fast reads and writes, ideal for caching and session management.
    • Examples: Redis, Amazon DynamoDB, Riak
  • Column-Family Stores
    • Organize data into columns rather than rows, optimized for analytical queries.
    • Designed for high scalability and distributed storage.
    • Examples: Apache Cassandra, HBase, ScyllaDB