NoSQL and Graph Databases

This course covers several specific types of NoSQL databases


What is a database?

what is a database asks Sheldon?


Sometimes we hear the word database used to describe an Excel file.

How can we put something as simple as a CSV or Excel file on the same level as these engineering marvels that are PostgreSQL, Weaviate, MongoDB, Neo4j, Redis, MySQL, etc…

So I asked my friend GPT-4 to give me a definition of a database:

In simple terms: “A database is like a smart notebook or filing system that helps you keep track of lots of information and find exactly what you need in no time.”

which definitely includes CSV files, Excel files, JSON files, XML files, and many other simple file-based formats.

If we look for the definition of a database in a more classic source of knowledge, like the Encyclopedia Britannica, we get:

database, any collection of data or information specifically organized for rapid search by computer. Databases are structured to facilitate storage, retrieval, modification, and deletion of data.

See also the Database article on Wikipedia.

four main functional groups:

  • Data definition – Creation, modification and removal of definitions that detail how the data is to be organized.
  • Update – Insertion, modification, and deletion of the data itself.
  • Retrieval – Selecting data according to specified criteria (e.g., a query, a position in a hierarchy, or a position in relation to other data) and providing that data either directly to the user, or making it available for further processing by the database itself or by other applications. The retrieved data may be made available in a more or less direct form without modification, as it is stored in the database, or in a new form obtained by altering it or combining it with existing data from the database.
  • Administration – Registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control, and recovering information that has been corrupted by some event such as an unexpected system failure.

We’re no longer just talking about quickly finding information (the search part) but also about:

This is where a simple spreadsheet file no longer meets the objective.


What we expect from a DBMS

A database management system (DBMS) is distinguished from a simple spreadsheet by several essential functionalities.

Feature Description Excel DBMS
Data Storage and Retrieval Stores data in an organized manner and retrieves it as needed.
Data Manipulation Allows adding, modifying, or deleting data.
Data Querying Allows asking complex questions (queries) about the data. ✔️
Data Organization Structures data in formats such as tables, documents, or graphs to facilitate management.
Data Sharing Allows multiple users or applications to use the database simultaneously.
Data Security Protects data against unauthorized access or corruption. ✔️
Concurrency Control Manages multiple users modifying data at the same time without conflicts.  
Backup and Recovery Ensures that data is not lost and can be restored in case of failure. ✔️
Data Integrity Ensures that data remains accurate, consistent, and reliable.  
Performance Optimization Provides tools to optimize the speed and efficiency of data retrievals and updates.  
Support for Transactions Ensures that a group of operations (transactions) is completed entirely or not at all.
ACID compliance
 

A Brief History of Databases

Evolution of database from 1970 to 2024

new version

Evolution of database from 1970 to 2024

Evolution of database from 1970 to 2024

1970s - The Beginning of the Relational Era

1980s - The Domination of Relational

1990s - The Object-Oriented Wave

2000s - The Beginning of the NoSQL Revolution

And in 2009, 2 new databases are launched:

Tadaaah!

Why at this moment?

The rise of the world wide web (myspace - 2003 😍, youtube - 2005) and the massive increase in application scale by several orders of magnitude.

Suddenly, we have millions of people simultaneously trying to access and modify Terabytes of data in milliseconds.

Relational databases cannot keep up with the scale of applications, the chaotic nature of unstructured data, and the speed requirements.

The promise of NoSQL is volume and speed.

2010s - NoSQL Matures & Specialization - Big Data and Specialized Databases

and meanwhile, in 2013, Docker containers revolutionize database deployment

2020s: AI, Vector Databases, and Real-Time Needs

Vector search is booming. Vector search capabilities are being integrated into most existing DBMS, including PostgreSQL, MongoDB, and Neo4j.

In Brief

Main Categories of Databases Today

We have many databases to choose from. It all depends on scale, application nature, budget, etc.

Database Type Purpose Examples Application
Relational - SQL Fixed schema PostgreSQL, MySQL, Oracle Transactions, normalization, OLTP
Document Stores Flexible schema, JSON-like documents MongoDB, CouchDB Web applications, content management
Graph Databases Relationship-centered data Neo4j, ArangoDB Social networks, recommendation engines
Key-Value Stores Simple and fast lookups Redis, DynamoDB Caching, session management
Vector Databases Similarity search, AI embeddings Pinecone, Weaviate AI applications, semantic search
Column-Family Stores Wide-column data, high scalability Cassandra, HBase, BQ Time-series, big data applications. OLAP
Time-Series Databases Time-ordered data InfluxDB, TimescaleDB IoT, monitoring systems

So many databases! So little time!

Check the ranking of all databases at https://db-engines.com/en/ranking

Trends: https://db-engines.com/en/ranking_trend

database trends

Numerous players:

Database Market Map 2024

source: https://www.generativevalue.com/p/a-primer-on-databases

Also see this interactive map that lists all players in 2023.


Let’s take a step back and compare an SQL relational database and a NoSQL database.

SQL

An SQL relational database (and SQL variations):

A relational database can be compared to a collection of well-organized spreadsheets (tables) where each column is defined, and the tables are interconnected.

Tables have columns and rows of data. Each table has a unique key called a primary key.

Schema design relies on the concept of normalization/denormalization. In a normalized database, information exists only in one table. A normalized database validates a series of rules called NF1, NF2, …

Important concept of Normalization: information exists in one and only one place.

Hierarchy of an SQL Database:

SQL databases = rigid, controlled, consistent data, stable.

Can be complex.


NoSQL

A non-relational database is a flexible file system where you can store items

NoSQL databases use non-relational data models, such as:

NoSQL databases encompass very different data organization systems.

since NoSQL is litteraly No + SQL it contains everything that is not SQL. A definition by the negative.

The term “NoSQL” is fundamentally a negation, which is why it’s often criticized as being a poor category name — it tells you what it’s against but not what it’s for.

A data representation based on a data graph is very different from a document-type representation similar to JSON where content variability is unlimited.

Schema flexibility

In all cases, we talk about flexible schema with unstructured or semi-structured data. Schema flexibility allows adding new data types or changing the data structure without requiring complex migrations.

NoSQL databases are ideal for

Comparison of Replication and Partitioning

Although both SQL and NoSQL databases support replication and sharding, there are key differences in their implementation and use:

Aspect SQL Databases NoSQL Databases
Replication Often synchronous or asynchronous, with possible strong consistency. Often asynchronous, with eventual consistency.
Sharding More complex to implement, requires external tools or manual configurations. Natively integrated in many NoSQL databases (e.g., MongoDB, Cassandra).
Flexibility Less flexible due to rigid schema and ACID constraints. Very flexible, suited for unstructured data and varied data models.
Horizontal scalability Possible, but often more difficult to manage due to joins and distributed transactions. Designed from the outset for horizontal scalability, with native distributed architectures.

Terminology

In a document-type NoSQL database like MongoDB

All documents in a collection are similar in structure but don’t need to be exactly identical.

There is no concept of normalization.

MongoDB Hierarchy - NoSQL Database - MongoDB

MongoDB - SQL:

MongoDB SQL Database
database database
Collection Table
Document Record/Row
Field Column
Embedded Document Foreign Key
_id Primary Key
$lookup JOIN

An index remains an index


Schema-less - dynamic schema

In what situations do data change so often that we would need a special type of database?

The most common example of a NoSQL application is a social network.


Yuka and Food Products

Let’s take the example of a startup like Yuka

Yuka

In their own words: Yuka deciphers product labels and analyzes the health impact of food and cosmetic products.

Its underlying database is the open food facts database, a food product database made by everyone, for everyone, with more than 3.5 million food products.

Look, for example, at the information for Nutella and that for… Baguette

Also see this article exploring the dataset with python pandas: https://medium.com/@achrafelkhanjari99/a-deep-dive-into-the-open-food-facts-dataset-56259b162ac5

With so many products, the available information and associated information (Packaging, Carbon Impact) as well as the diversity of regulations (EU, US, UK, … etc.) constantly vary.

The data is constantly updated while the history of changes and new additions must be preserved.

So you start your database with a simple schema that includes:

But the schema becomes increasingly complex as the data, products, and company services evolve.


Nutriscore

Take for example, the Nutriscore label:

Nutriscore

The Nutriscore has recently evolved with a stricter new version. So you need the new Nutriscore labels while keeping the old one because not all products implement the new Nutriscore. Some companies have even completely abandoned labeling.

You started with a Nutriscore table in an SQL database:

product_id: key
nutriscore_label : array[A,B, .., E]

so your Nutriscore table requires a new column:

product_id: key
nutriscore_label: array[A,B, .., E]
nutriscore_new_label: array[A,B, .., E]

However, most products don’t yet have a new Nutriscore label.

And you end up with a lot of null values in this nutriscore_new_label column, and null values should be avoided 👹👹👹.

NULL values headache

You can also normalize the table and introduce a Nutriscore version column to help with Null values.

product_id: key
nutriscore_label: array[A,B, .., E]
nutriscore_version: Int

In both cases, you have to change all your SQL queries in your codebase.

Pain, worries, migraines, bugs, and additional costs $$$.

Introduction to Schema Flexibility

Schema Flexibility in MongoDB and other NoSQL databases refers to the ability to store data without requiring a predefined schema. This means that documents in the same collection can have different fields/attributes, structures, and data types.

Schema Flexibility helps manage unknown unknowns in a rapidly changing world.

Data Presence and Type

In MongoDB: You can simply add a new Nutriscore element to food products:

No Nutriscore

{
  "product_id": 198273,
  "name": "Chocapic",
}

Nutriscore is added, just add a field to the product document

{
  "product_id": 198273,
  "name": "Chocapic",
  "Nutriscore": "C"
}

A new version of Nutriscore arrives, just add the Nutriscore label as a dictionary with versions as keys:

{
  "product_id": 198273,
  "name": "Chocapic",
  "Nutriscore": {
    "v1": C,
    "v2": D,
  }
}

Several representations can therefore coexist in the same database:

Nested Data

A good example is a person’s address

In the same database, we can have people who have no address, one address, or several. And these multiple addresses have different roles: primary residence, secondary, etc….

If we use a JSON format to represent these 3 cases, we naturally have

// A person with no registered address
{
    "_id": "1",
    "name": "Anita Sharma",
    "age": 29,
    "email": "[email protected]"
}

// A person with a single address as a simple dictionary
{
    "_id": "2",
    "name": "Rahul Verma",
    "age": 42,
    "email": "[email protected]",
    "address": {
        "type": "home",
        "street": "12 MG Road",
        "locality": "Indiranagar",
        "city": "Bengaluru",
        "state": "Karnataka",
        "pincode": "560038",
        "country": "India"
    }
}

// A person with multiple addresses as a list of dictionaries
{
    "_id": "3",
    "name": "Priya Singh",
    "age": 35,
    "email": "[email protected]",
    "addresses": [
        {
            "type": "home",
            "street": "45/2 Lajpat Nagar",
            "locality": "Central Market",
            "city": "New Delhi",
            "state": "Delhi",
            "pincode": "110024",
            "country": "India"
        },
        {
            "type": "work",
            "street": "4th Floor, Tower B",
            "locality": "DLF Cyber City",
            "city": "Gurugram",
            "state": "Haryana",
            "pincode": "122002",
            "country": "India"
        }
    ]
}

In SQL, you would need to have an address table and a many-to-many relationship between the person table and the person table, so an intermediate table for the join.


Consequences of Schema Flexibility

Schema flexibility impacts every stage of a database’s lifecycle

Extra caution is necessary to avoid chaos and data inconsistencies. Query performance can be affected if changes in data structure lead to inefficient or inconsistent indexing.

With NoSQL databases, the cost of implementing changes in the nature of data is shifted from the database to the application level.

However, inconsistencies in the database can still occur if multiple applications interact differently with the same database.

In short, schema flexibility should be used with caution and only when useful and justified.


When to Choose NoSQL (Document Database) Rather Than SQL?

So when is a NoSQL document database a better choice than SQL?

MongoDB scales out, while PostgreSQL scales up.

MongoDB excels in:


Performance

In terms of performance, the comparison most often favors PostgreSQL over MongoDB. see MongoDB Vs PostgreSQL: A comparative study on performance aspects

And this other article, MongoDB vs PostgreSQL: Choosing the Best Database for Your Needs, summarizes it well:

_MongoDB shines in scenarios requiring the development of software applications that process various types of data in a scalable manner. It is particularly suitable for projects that need to support rapid iterative development and facilitate the collaboration of many teams.

In Postgres vs. MongoDB: a Complete Comparison in 2024


In Brief


What About Graph Databases?

SQL databases are called relational databases.

In an SQL database, the relationship between tables is explicitly defined by foreign keys between tables.

And the ERD for such a database only indicates the cardinality of the relationship:

ERD

When you ask an LLM to generate a diagram for a product database, it naturally adds meaningful information to the relationships between tables.

product veggies schema relation Mermaid diagram


Relationships

Graph databases like Neo4j are centered on the meaning of relationships between entities.

Relationships are as important as the data itself and are explicitly stored.

These relationships have their own properties and are stored as connections.

Whereas in SQL:

This is why Neo4j excels in questions like:

Here is an example of a knowledge graph with Obsidian

Obsidian Knowledge Graph


Conclusion

So SQL databases, called relational databases, are excellent for data structures that don’t change often and where relationships between objects are stable.

NoSQL databases like MongoDB: excellent when data specifications evolve rapidly or are not definitive, flexible schema, large scale

Graph databases: the relationship is key. It’s not just that there is a relationship but also what is the nature of that relationship.

In this session, you learned: