28  28. “databases” and “database management systems”

A “database (DB)” is a collection of data that can be accessed by a computer programs (see below) in an accurate and efficient way.

A “Database Management System (DBMS)” is the software that is used to enable the storage and retrieval of the data in a database.

28.1 Common features of databases

Most databases and DBMS software provide the following features:

  • Able to be contolled by other software - Databases and DBMS software are usually used as part of a larger software system. Software that is coded in programming languages such as R, Python, Java, etc. have the ability to interact with a DBMS in order to store an retrieve data from a database.

    For example, websites that manipulate a lot of data generally consist of a “web server software program” that users communicate with. The “web server software” program in turn communicates with a DBMS in order to store and retrieve data for each user. For example, facebook.com makes use of one or more databases to store and retrieve messages, friend information, advertisements, etc. The web server program retrieves the information it needs from a database and uses that information to create a web page that is then returned to the user and displayed in the browser.

  • Reliable and Accurate Data – Databases have built-in rules to keep data correct and consistent. They prevent mistakes like duplicate entries or missing information by ensuring that all data follows set guidelines.

  • Multi User Access – Databases allow many people to access and update data at the same time without causing errors or losing information. They use special features to keep everything organized, even when multiple users make changes at once.

  • Handles Tremendous Amounts of Data – Databases are designed to store and manage huge amounts of data efficiently. They remain fast and responsive even as the amount of data grows.

  • Security – Databases have security settings to control who can see or change the data.

  • Reliability - Databases have safeguards to ensure that changes are properly saved and that information isn’t lost, even if something goes wrong (e.g. a blackout, earthquake, etc.).

Note that Excel is not usually considered a “database”. Excel is FANTASTIC as a tool for quickly gaining insights into data. However, Excel does not have most of the features listed above that describe typical databases.

28.2 Different categories of databases

There are different types of databases and Database Management System (DBMS) software. These types can be categorized by how their data is structured.

Relational databases (sometimes referred to as “SQL” databases) arrange data in “tables” that are comprised of “rows” and “columns” (similar to an R dataframe or an Excel spreadsheet).

Non-Relational databases (often referred to as “NoSQL” databases) structure data in other forms (e.g. JSON or graphs - see below)

28.3 What is a “Relational Database”

Relational databases store information in “tables” that have rows and columns of information - very similar to the way the data is arranged in an R dataframe. A single relational database is a collection of one or more tables. Each column of a relational database table (a) has a name (b) has a datatype (e.g. numeric, logical, character, etc)

There are MANY different relational database products in use today. Some very popular ones are:

  • Postgres
  • MySql
  • MariaDb
  • Microsoft SQL Server
  • Oracle Relational Database Management System
  • SQLite
  • many, many others

28.4 What is SQL (often pronounced “sequel”)

Relational database software uses Structured Query Language (SQL) to manipulate database contents. SQL commands are categorized as follows:

Relational database software uses the Structured Query Language (SQL) (often pronounced “sequel”) to manipulate the contents of the database. Commands in the SQL language are often grouped into the following categories:

(For this class, we will focus on the first two categories listed below, i.e. DDL and DML)

Category Commands
DDL (Data Definition Language) Commands
Defines and modifies database structure.
CREATE DATABASE – Creates a new database.
CREATE TABLE – Creates a new table in an existing database.
DROP TABLE – Removes an entire table from a database.
DML (Data Manipulation Language) Commands
Manipulates data within tables.
SELECT – retrieves data from database tables. The SELECT statement has many different options that allow the data to be retireved in very customized formats and arrangements. E.g. SELECT can be used to summarize data, retrieve only data that has specific attributes, display data in specific order, etc.
INSERT – Adds new rows to a table.
DELETE – Removes rows from a table.
UPDATE – Modifies existing data in a table.
DCL (Data Control Language) Commands
Manages user permissions.
GRANT – Give permission to specific users to access specific data in the database.
REVOKE – Remove permission from specific users to access specific info in the database.
TCL (Transaction Control Language) Commands
Transactions ensure that a group of operations either all succeed or none take effect. If a failure occurs, the system ensures consistency using transaction control commands. (* see below for more info about transactions)

START TRANSACTION – Begins a transaction.
COMMIT – Saves changes permanently.
ROLLBACK – Reverts changes if needed.
SAVEPOINT – Creates checkpoints within a transaction.

(* transactions explained more fully): Most Relational databases offer “transaction control”. A transaction is a set of operations that are supposed to either all happen as a group or not happen at all. For example, when processing a money transfer between two bank accounts, the following steps must happen:

Step 1: remove money from account 1
Step 2: deposit money in account 2

These two steps are viewed as a “transaction”. Either both steps should happen - in which case the transaction was successful - or neither step should happen. Imagine that after the computer did Step 1, the computer crashes (e.g. due to a blackout) before it had a chance to do Step 2. Ideally, when the computer is restarted, either, Step 2 should happen or Step 1 should be “undone”. Most Relational Database Management Software contains features that ensure that this logic happens correctly.

(For this class, we will mostly focus on the DDL an DML categories of commands. Most of our time will be spent studying the numerous features of the SELECT statement)

28.5 Different “flavors” of SQL

Each relational database software product has its own “flavor” of the SQL language. You can think of this as different “dialects” of the same language. For example the difference between English as it’s used in the USA vs as it is used in England.

ANSI (American National Standards Institute) publishes a SQL Standard that is generally recognized by all relational database software programs.

However, most relational database products add additional capabilities and commands to its version of SQL to differentiate it from other relational database products in the market.

In this class we will be working with the SQLite version of SQL. However, we will try to focus primarily on the standard SQL features that are included in all relational database products and not on the extensions that are provided only by SQLite. When appropriate, we will try to highlight differences between the way that SQLite and other RDBMS products work.

The ANSI standard was first introduced in 1986 and has been reissued every few years since then. The following gives a flavor of features that were introduced or “deprecated” (keep reading) in different versions of the ANSI standard. To “deprecate” a feature means to recommend that it not be used as the feature might actually be removed in a future version.

Info about ANSI SQL standard
SQL Version Year Key Features Introduced Deprecated Features
SQL-86 (SQL-87) 1986/1987 First ANSI SQL standard, basic SQL syntax (SELECT, INSERT, UPDATE, DELETE), table creation, constraints None
SQL-89 1989 GRANT and REVOKE for access control None
SQL-92 1992 Expanded data types (DATE, TIME, INTERVAL), subqueries, joins (INNER JOIN, LEFT JOIN, RIGHT JOIN), string pattern matching (LIKE) Some legacy syntax
SQL:1999 1999 Recursive queries (WITH RECURSIVE), OLAP operations (ROLLUP, CUBE), triggers, procedural constructs (CASE, IF, LOOP), user-defined types (UDTs) Certain outdated syntax elements
SQL:2003 2003 Window functions (OVER, PARTITION BY), XML support, auto-increment (IDENTITY, GENERATED), MERGE statement Outdated join syntax
SQL:2006 2006 Full XML data handling, querying, and updates None
SQL:2008 2008 TRUNCATE TABLE, FETCH FIRST (limit control), improvements to MERGE None
SQL:2011 2011 Temporal tables (SYSTEM VERSIONING for time-travel queries) None
SQL:2016 2016 JSON support (JSON_VALUE, JSON_QUERY), row pattern recognition (MATCH_RECOGNIZE) None
SQL:2019 2019 Polymorphic table functions, new JSON functions and enhancements None

28.6 Non-Relational Databases (AKA NoSQL databases)

NoSQL (or non-relational) databases organize information in formats other than rows and columns.

SOME NoSql databases actually CAN be manipulated with the SQL language. However, the term NoSql has come to include any database that stores information in a form other than in strict rows and columns.

Some examples are shown below.

28.7 Document databases - e.g. MongoDB

MongoDb is an example of Database Management software that stores information in “JSON” notation. Other similar products include CouchDB, Firebase and Firestore. We will explore more about JSON later in the semester. However as an example, the above information could be represented in a JSON file as shown below:

           [
             {
               "name" : "annie",
               "friends" : [ "sam", "jack" ]
             },
             {
               "name" : "jack",
               "friends" : [ "annie" , "mac" , "doug" , "harry" ]
             },
             {
               "name" : "harry",
               "friends" : [ "jack" ]
             },
             {
               "name" : "sam",
               "friends" : [ "annie" ]
             },
             {
               "name" : "mac",
               "friends" : [ "jack" ]
             },
             {
               "name" : "harry",
               "friends" : [ "jack" ]
             },
             {
               "name" : "doug",
               "friends" : [ "jack" ]
             }
           ]

28.8 Graph databases - e.g. Neo4J

Neo4J is an example of a “graph database”. These types of databases organize their information in a “graph” structure. A graph looks like a “web” of information. It is comprised of “nodes” that contain data which are connected via “edges” that represent relationships between the nodes.

For example:

graph TD
    %% Position nodes to match example layout
    sam([sam])
    annie([annie])
    mac([mac])
    jack([jack])
    doug([doug])
    harry([harry])
    
    %% Create bidirectional connections with straight lines
    annie <-->|friend| sam
    jack <-->|friend| annie
    jack <-->|friend| mac
    jack <-->|friend| doug
    jack <-->|friend| harry
    
    %% Black and white styling
    classDef person fill:white,stroke:black,stroke-width:1px;
    class sam,annie,mac,jack,doug,harry person;
    
    %% Force specific layout
    sam ~~~ annie
    mac ~~~ jack ~~~ doug
    jack ~~~ harry
    
    %% Make invisible linkage lines transparent
    linkStyle 5,6,7 stroke:none;

This example shows a set of people and who is friends with whom. There is a different icon for each person. “Friend arrows” indicate who is friends with whom.

For many types of information a diagram as shown above is much more intuitive and understandable than a table of information. This is just one reason why graph databases are popular.

However, this same information could be converted to the following table and stored in a relational database.

            name      is_friends_with
            ----      ---------------
            sam       annie
            annie     sam
            annie     jack
            jack      annie
            jack      mac
            jack      doug
            jack      harry
            doug      jack
            harry     jack
            mac       jack

The following is version of the graph also includes “company” nodes and relationships indicating where people work. In general, graphs can be quite complex and be used to store all sorts of information.

graph LR
    %% Define nodes with type labels
    sam([Person: sam])
    annie([Person: annie])
    mac([Person: mac])
    jack([Person: jack])
    doug([Person: doug])
    harry([Person: harry])
    Acme([Company: Acme Inc.])
    MikroTek([Company: MikroTek])
    
    %% Friend relationships (bidirectional)
    annie <-->|friend| sam
    jack <-->|friend| annie
    jack <-->|friend| mac
    jack <-->|friend| doug
    jack <-->|friend| harry
    
    %% Work relationships
    sam -->|works at| Acme
    annie -->|works at| Acme
    mac -->|works at| MikroTek
    jack -->|works at| MikroTek
    doug -->|works at| MikroTek
    
    %% Styling for nodes - white background with black borders and text
    classDef person fill:white,stroke:black,stroke-width:1px,color:brown;
    classDef company fill:white,stroke:black,stroke-width:2px,stroke-dasharray:5 2,color:green;
    
    %% Apply styles
    class sam,annie,mac,jack,doug,harry person;
    class Acme,MikroTek company;

28.9 Other types of NoSQL databases

The are other types of NoSQL databases. The following are also popular. We will not go into these in more detail now.

The main thing that NoSQL databases have in common is that the data is NOT stored in a rectangular form (i.e. rows and columns).

  • Key-Value Stores
    • Store data as a collection of key-value pairs.
    • Fast reads and writes, ideal for caching and session management.
    • Examples: Redis, Amazon DynamoDB, Riak
  • Column-Family Stores
    • Organize data into columns rather than rows, optimized for analytical queries.
    • Designed for high scalability and distributed storage.
    • Examples: Apache Cassandra, HBase, ScyllaDB