MongoDB: Deep Dive

What truly characterizes MongoDB:

Choosing a data tore like MongoDB vs a more standard SQL database, is driven by the data and the application that will consume that data.


SQL vs NoSQL

This is the first question you'll probably ask when starting to design your application.

We talked about this last time. But it's interesting to examine this in more detail.

In short, document-based databases make sense for:


By the way, it's well known that MongoDB is webscale:


SQL vs NoSQL

You've moved beyond simple sqlite database or JSON, csv file storage.

What should you turn to next?

Some questions when I compare these different backends:

An excellent article on when to use NoSQL: https://medium.com/@sqlinsix/when-to-use-sql-or-nosql-b50d4a52c157


Sidenote NoSQL supports SQL !

The article starts with:

NoSQL means Not only SQL.

=> With some NoSQL APIs, you may be able to write SQL against them.

And indeed, some NoSQL databases have evolved to support SQL-like query languages or SQL interfaces, even though they weren't originally designed with SQL in mind.


The workout example - flexible data

Imagine a gym/workout application where you track your sessions.

At the gym

The SQL part

There's a series of entities (think tables) compatible with a SQL database:

So subscription management fits well with a relational/SQL database.

The data seldom evolves and the transaction dimension is important.


The NoSQL part: document

But the type of workout a user does, by nature varies a lot.

A fixed data schema is not suitable.

Another extract from the same article:

"The complexity of joins would be enormous compared to a person's training, as there can be dozens of combinations. Additionally, workouts such as drop sets where the weight constantly changes and where a maximum (or minimum) repetition range must be respected would be extremely difficult to track with a fixed schema for a SQL database."

An application exploiting workout data is a good example where a flexible schema data structure really makes sense.


SQL vs NoSQL

You need to think about how the application consumes the data.

For example, if the application retrieves an entire workout for a given day.

Note: PostgreSQL now has JSON data types that can be used for such polymorphic data cases.


SQL vs NoSQL

Data polymorphism

When data has attributes that vary significantly, a NoSQL document store database is better suited for the final application.

On the other hand, stable data that evolves little is better handled by a classic SQL database.


ACID:

ACID is a set of properties guaranteeing the reliability of transactions in a database:

  1. Atomicity: A transaction is either fully executed or canceled in case of failure (all or nothing).
  2. Consistency: A transaction brings the database from one valid state to another valid state, respecting integrity constraints.
  3. Isolation: Concurrent transactions execute without interfering with each other, as if they were executed sequentially.
  4. Durability: Once validated, a transaction is permanently recorded, even in case of failure.

These properties are essential for ensuring data integrity in relational databases and some NoSQL databases.

A need for ACID is detailled here.


CriteriaPostgreSQLMongoDB
Native ACID?✅ Yes (complete)⚠️ Partial (multi-docs expensive)
Multi-document transactions✅ Yes and optimized⚠️ Yes but expensive
Strong isolation✅ Yes (SERIALIZABLE)⚠️ Not as strict
Scalability⚠️ Less flexible✅ More flexible for big data

Is MongoDB less reliable for ACID than PostgreSQL?

👉 Yes, MongoDB is less strict than PostgreSQL regarding ACID, particularly for multi-document transactions and concurrent isolation.

Example: Multi-document transaction consistency problem

Banking database case: Money transfer between two accounts


Money transfer between two accounts

Imagine you have a database with two collections:

PostgreSQL (Robust ACID)

BEGIN;

UPDATE accounts
SET balance = balance - 500
WHERE account_id = 1;

UPDATE accounts
SET balance = balance + 500
WHERE account_id = 2;

INSERT INTO transactions (from_account, to_account, amount)
VALUES (1, 2, 500);

COMMIT;

MongoDB (Risk with multi-document transactions)

const session = db.getMongo().startSession();
session.startTransaction();

try {
  // Debit 500€ from account 1
  session.getDatabase("bank").accounts.updateOne(
    { account_id: 1 },
    { $inc: { balance: -500 } }
  );

  // Credit 500€ to account 2
  session.getDatabase("bank").accounts.updateOne(
    { account_id: 2 },
    { $inc: { balance: 500 } }
  );

  // Add a transaction record
  session.getDatabase("bank").transactions.insertOne(
    { from: 1, to: 2, amount: 500, timestamp: new Date() }
  );

  session.commitTransaction();
} catch (error) {
  session.abortTransaction();
}
session.endSession();

🚨 Potential problem with MongoDB:


MongoDB has Snapshot Isolation

When a transaction starts, MongoDB gives it a fixed snapshot of the data.

The transaction reads a frozen picture of the database.

It does not see changes from other concurrent transactions.

But it does NOT prevent certain race conditions because it does not simulate a world where all transactions run in a strict order.

This allows write conflicts or anomalies like write skew.


🔎 Conclusion

CriteriaPostgreSQLMongoDB
Native ACID?✅ Yes (complete)⚠️ Partial (multi-docs expensive)
Multi-document transactions✅ Yes and optimized⚠️ Yes but expensive
Strong isolation✅ Yes (SERIALIZABLE)⚠️ Not as strict
Scalability⚠️ Less flexible✅ More flexible for big data

OLTP vs OLAP

---

OLTP vs OLAP

Consider 2 main types of database usage: transactions or dashboards

Do you need an OLAP (Online Analytical Processing) or OLTP (Online Transaction Processing) database?

And in which case will a MongoDB-type database be more suitable than a SQL database?


OLTP

On one side, OLTP (Online Transaction Processing) systems focus on managing a large number of short and atomic transactions in real time.

They are optimized for write operations and transactional integrity, commonly used for applications such as e-commerce, banking, and inventory management.

In OLTP systems, we maximize write reliability, data integrity, and database normalization.

Think OLTP == atomic transactions. Many small, quick operations like:

=> Random Access Patterns


OLAP

On the other side, OLAP (Online Analytical Processing) systems are designed to query and analyze large volumes of data, often involving complex joins, aggregations, and data transformations.

These systems are read-intensive and require optimized data retrieval.

Denormalization simplifies queries and improves performance.

Think OLAP ~= dashboards.

Examples:

This requires scanning everything or huge portions of the dataset → sequential reads.


OLTP vs OLAP

So, the question is: should we use NoSQL or SQL for OLAP? for OLTP?


OLTP vs OLAP

MongoDB, as a document-oriented NoSQL database, is optimized for fast read and write operations on individual documents, which is ideal for OLTP applications.

However:

For OLTP applications in the strict sense of the term, we would choose a SQL database because it's more ACID than MongoDB.

So

OLTP: Focuses on managing real-time transactional data, emphasizing speed and accuracy for operations like order processing, inventory management, and financial transactions.

OLAP: Centers on complex data analysis and reporting, handling large volumes of historical data to support decision-making processes.


Here's a recap table

CategoryMongoDBPostgreSQL
Best Fit (OLTP vs OLAP)OLTP-style apps with flexible schemas; OLAP via pipelines but limited for heavy joinsStrong OLTP; solid OLAP with SQL, CTEs, window funcs; pairs with warehouses for large-scale OLAP
StrengthsSingle-document reads/writes, flexible schema, horizontal scalingRelational integrity, powerful SQL, complex joins, mature tooling
Join HandlingLimited (denormalize/lookup); joins not native like SQLNative, optimized joins across normalized schemas
ACID TransactionsSingle-document ACID by default; multi-document transactions exist but add overheadFully ACID-compliant with strong consistency guarantees
Typical Use CasesEvent logs, content/user profiles, catalogs, fast CRUD with evolving schemaFinancial/ordering systems, analytics on normalized data, complex queries

OLAP vs OLTP

In short:

MongoDB is great for high-throughput OLTP workloads, as long as your application doesn’t depend on strict multi-document ACID guarantees where PostgreSQL is stronger.


[Thought EXERCISE] questioning an LLM

Ok, MongoDB is not efficient for OLAP that requires many joins

But

Scenario: if you transfer the join and aggregation workload to a separated data aggregation and collection step (through regular background tasks for example) and insert the result into a MongoDB collection, does a document database like MongoDB become competitive for OLAP applications?


Cases better suited to MongoDB

beyond OLTP / OLAP

MongoDB is well-suited for applications that require flexibility, scalability, and the ability to handle diverse and evolving data types.


Moral of the story

This phrase highlights when to use NoSQL

extract from: https://softwareengineering.stackexchange.com/a/355964/440337

NOSQL is very much designed as a persistence layer for an application.
So access is optimized for a single object with children.

Many things to unpack here:

NoSQL document store databases are built specifically to meet the data storage needs of applications. The important word being "Application".

In comparison, SQL databases are designed for data organization and relationships, but the application needs are secondary.


So

Why is this optimized for a single object with children?

single is an important word here:

NoSQL databases physically store the main object and its sub-dependencies together in a single location on disk. Instead of splitting associated data across multiple tables like in relational databases.

Retrieving all associated data requires fewer disk operations and is less resource-intensive and therefore faster.

This also implies a key architectural principle: NoSQL databases are designed around the specific ways applications need to access data, rather than maintaining formal data relationships.

Application needs >>> Structured data organization

This makes them particularly efficient when an application needs to quickly retrieve/update an entire object with all its associated data in a single operation.


Let's recap

NoSQL / document when

SQL when


NoSQL database performance

---

NoSQL database performance

Several technical factors can affect performance:

These random searches are much slower than sequential reads, because the disk read head must physically move to different locations. This is why operations that require scanning many documents (like analytical queries) can be slower in MongoDB than in relational databases.

On point 2, Disk access: The problem of data scattered around the physical disk at random locations is true for all databases.

But PostgreSQL (for example) is much more optimized for this kind of task than MongoDB is.


Handling multi-record operations

There are key architectural differences in how the two databases handle multi-record operations:


Recap Access Efficiency


Joins and buffering

So, while both databases must perform disk operations, PostgreSQL includes specific optimizations to efficiently handle multi-record operations, particularly when it comes to joins and large table scans.

These optimizations are less present in MongoDB because its architecture prioritizes single-document operations.


Conclusion on SQL vs NoSQL - PostgreSQL vs MongoDB

It is impossible to escape complexity.

Speed is not the only factor to consider when choosing a database.

Regardless of any technical consideration, engineer availability and productivity have more impact on costs than a few milliseconds gained on a query.

What motivates MongoDB adoption over PostgreSQL is


Performance

PostgreSQL vs MongoDB performance comparison

Read this article https://medium.com/@vosarat1995/postgres-vs-mongo-performance-comparison-for-semi-structured-data-9a5ec6486cf6

The author creates a large dataset and tests the performance of MongoDB and PostgreSQL.

| Test                             | Mongo     | PostgreSQL |
| -------------------------------- | --------- | ---------- |
| Index creation on empty database | 152.9 ms  | 402.0 ms   |
| Batch insertion                  | 53.50 ms  | 219.15 ms  |
| One-by-one insertion             | 319.3 ms  | 641.9 ms   |
| Multiple record reading          | 21.83 ms  | 66.63 ms   |
| Index creation on 1000 rows      | 1.563 s   | 1.916 s    |
| Complex Query                    | 174.51 ms | 60.43 ms   |

MongoDB wins most of the time.

But PostgreSQL wins in the final sprint!

Furthermore: "It is worth noting that the tests were on semi-structured data which is the realm of Mongo."


Explanation of query planning in MongoDB

Without going into the details of index creation with MongoDB for now, let's analyze query planning on

db.movies.find({ "title": "The Perils of Pauline" }).explain("executionStats")

To EXPLAIN a query:

Note the following elements


Index creation and planning

compare: db.movies.find({ "title": "The Perils of Pauline" }).explain("executionStats")

Then, create an index db.movies.createIndex({ title: 1 })

again

db.movies.find({ "title": "The Perils of Pauline" }).explain("executionStats")`

Drop the index db.movies.dropIndex({title : 1})

Now explain an aggregation pipeline

db.movies.explain("executionStats").aggregate([ { $group: { _id: "$genres", averageRating: { $avg: "$imdb.rating" } } }] )

Query planning comparison with PostgreSQL

Consider a similar query scenario that involves joining/linking data and aggregating results.

Query: find all movies from the 1990s with their directors, grouped by director with average ratings.

First, MongoDB:

db.movies.aggregate([
  { $match: {
      year: { $gte: 1990, $lt: 2000 }
  }},
  { $unwind: "$directors" },
  { $group: {
      _id: "$directors",
      avgRating: { $avg: "$imdb.rating" },
      movieCount: { $sum: 1 }
  }}
]).explain("executionStats")

The plan is:

{
  "stages": [
    {
      "$cursor": {
        "queryPlanner": {
          "plannerVersion": 1,
          "namespace": "sample_mflix.movies",
          "indexFilterSet": false,
          "parsedQuery": {
            "year": { "$gte": 1990, "$lt": 2000 }
          },
          "winningPlan": {
            "stage": "COLLSCAN",  // Full collection scan
            "filter": { "year": { "$gte": 1990, "$lt": 2000 } }
          },
          "rejectedPlans": []
        }
      }
    },
    { "$unwind": "$directors" },  // Memory-intensive operation
    {
      "$group": {
        "_id": "$directors",
        "avgRating": { "$avg": "$imdb.rating" },
        "movieCount": { "$sum": 1 }
      }
    }
  ]
}

Now, the PostgreSQL equivalent (assuming a normalized schema):

EXPLAIN ANALYZE
SELECT d.name as director,
  AVG(m.rating) as avg_rating,
  COUNT(*) as movie_count
FROM movies m
JOIN movie_directors md ON m.id = md.movie_id
JOIN directors d ON md.director_id = d.id
WHERE m.year >= 1990 AND m.year < 2000
GROUP BY d.name;

Which gives:

QUERY PLAN
------------------------------------------------------------
HashAggregate  (cost=245.97..247.97 rows=200)
  Group Key: d.name
  ->  Hash Join  (cost=121.67..237.42 rows=1710)
        Hash Cond: (md.director_id = d.id)
        ->  Hash Join  (cost=66.50..164.42 rows=1710)
              Hash Cond: (md.movie_id = m.id)
              ->  Seq Scan on movie_directors md
              ->  Hash  (cost=58.00..58.00 rows=680)
                    ->  Index Scan using movies_year_idx on movies m
                          Index Cond: (year >= 1990 AND year < 2000)
        ->  Hash  (cost=40.50..40.50 rows=1173)
              ->  Seq Scan on directors d

The main differences in query planning are:

  1. Join handling:

    • PostgreSQL uses hash joins to efficiently combine data from multiple tables
    • MongoDB must $unwind the nested director array, which creates multiple documents in memory
  2. Index usage:

    • PostgreSQL can use multiple indexes simultaneously and choose different join strategies
    • MongoDB generally relies on a single index per aggregation stage
  3. Memory management:

    • PostgreSQL's hash joins create hash tables in memory with disk spillover if necessary
    • MongoDB's $unwind and $group stages must keep their data in memory
  4. Operation order:

    • PostgreSQL's query planner can reorder operations for efficiency
    • MongoDB must process aggregation pipeline stages in order
  5. Statistics usage:

    • PostgreSQL's explanation shows detailed cost estimates based on table statistics
    • MongoDB's explanation focuses more on operation type and index usage

The PostgreSQL plan shows that it can:

👽👽👽 This structured approach to complex queries is why PostgreSQL is often more performant for analytical workloads involving multiple tables/collections and aggregations.


Recap

MongoDB: To Remember

which leads us to the next topic: schema

1 / 0