Schema Design patterns

MongoDB schema design is the most critical part of database design.

If you design the MongoDB schema like a relational schema, you lose sight of the advantages and power of NoSQL.

Read the article: MongoDB Schema Design: Best Practices for Data Modeling


How to model data for MongoDB?

Document databases have a rich vocabulary that is capable of expressing data relationships in a more nuanced way than SQL.

There are many things to consider when choosing a schema.


In SQL databases, normalization consists of distributing data across tables to avoid duplication.

In SQL: multiple tables, apply normalization forms (NF1, NF2, ...), etc.


No rules

With MongoDB, there are:

No Rules

Which is absolutely not scary !!! 😳😳😳


How is the data consumed

What matters is designing a schema that works best for the final application.

Two different applications that use exactly the same data can have very different schemas if the data is used differently.

🐲🐲🐲 The application dictates the schema! 🐲🐲🐲

Normalization Reminder

---

Normalization

The general goal of normalization is to reduce data redundancy and dependency by organizing data into separate, related tables.

More formally, a database is normalized if:

all column values depend only on the table primary key,

With Denormalization, the idea is to have data redundancy to simplify queries and make OLAP queries faster.

Redundant data : the same data / info exists in multiple tables


Normal forms

A normal form is a rule that defines a level of normalization.

1NF

Each field contains a single value

A relation is in first normal form

if and only if

no attribute domain has relations as elements.

2NF

The table is in 2NF iff :

Some cases of non-compliance with 2NF

3NF

A relation R is in 3NF if and only if both of the following conditions hold:


Transitive Dependency

A transitive dependency occurs when a non-prime attribute (an attribute that is not part of any key) depends on another non-prime attribute, rather than depending directly on the primary key.

where:

In simple terms: A β†’ B β†’ C where A is the primary key, but C depends on B instead of directly on A.

For instance in a songs table, we have

This leads to data redundancy (Artist's info repeated), update anomalies (must change multiple rows to update one artist), and maintenance headaches.


Anomalies

When designing the schema for a SQL database, it helps to spot anomalies


Schema design

--- # Document embeddings

Embedding documents in MongoDB nests related data within a single document -> for efficient retrieval and simplified data representation.

{
  "title": "Paris Metro Stations",
  "stations": [
    {
      "name": "ChΓ’telet",
      "lines": [
        { "number": "1", "direction": "La DΓ©fense" },
        { "number": "4", "direction": "Porte de Clignancourt" },
        { "number": "7", "direction": "La Courneuve - 8 Mai 1945" },
        { "number": "11", "direction": "Mairie des Lilas" },
        { "number": "14", "direction": "Olympiades" }
      ],
      "connections": ["RER A", "RER B", "RER D"],
      "accessibility": true
    },
    {
      "name": "Gare du Nord",
      "lines": [
        { "number": "4", "direction": "Mairie de Montrouge" },
        { "number": "5", "direction": "Bobigny - Pablo Picasso" }
      ],
      "connections": ["RER B", "RER D", "Eurostar", "Thalys"],
      "accessibility": true
    }
  ]
}

Advantages of embedding documents


Limitations of embedding

There is therefore a balance to be found between information completeness and document size


Referencing (joins)

Reference another document using the unique object ID of a document and connect them using the $lookup operator.

Works the same way as the JOIN operator in an SQL query.


Advantages of referencing

Limitations of referencing


Schema design and nature of relationships

We must consider the nature of relationships between entities


One-to-one

Modeled as key-value pairs in the database.

For example:


One-to-few (tens)

A small sequence of data associated with the main entity.


One-to-many (hundreds)


One-to-squillions

Potentially millions of enbedded documents


16 MB constraint

A MongoDB record cannot exceed 16 MB. In the case of One to Millions, this might become problematic.

Example schema for server + log

Cnsider a server logging application where each server records a significant amount of events.

So we have 2 entities: server and event.

3 options


Many-to-many

Example of a project planning application:

An efficient schema is to store


Users:

{
    "_id": ObjectID("AAF1"),
    "name": "Kate Pineapple",
    "tasks": [ObjectID("ADF9"), ObjectID("AE02"), ObjectID("AE73")]
}

Tasks:

{
  "_id": ObjectID("ADF9"),
  "description": "Write a blog post about MongoDB schema design",
  "due_date": ISODate("2014-04-01"),
  "owners": [ObjectID("AAF1"), ObjectID("BB3G")]
}

In this example, you can see that each user has a sub-array of related tasks, and each task has a sub-array of owners for each item in our task application.


Summary:


General rules for MongoDB schema design


Schema design patterns

Let's examine 2 Schema models to illustrate how the application dictates data schema design.


Extended reference model

Let's review this article

https://www.mongodb.com/developer/products/mongodb/schema-design-anti-pattern-massive-arrays/

One of the rules of thumb when modeling data in MongoDB is to say that data that is accessed at the same time should be stored together.

-> A building has many employees: potentially too many for the 16 MB document limit.


extended reference pattern

We reverse the situation with

-> The employee belongs to a building: we integrate the building information into the employee document.

If the application frequently displays information about an employee and their building together, this model is probably wise.

Problem: we have way too much data duplication.

Updating a building's information involves updating all employee documents.

So, let's separate employees and building into 2 distinct collections and use $lookups.

But $lookups are expensive.

We therefore use the extended reference pattern where we duplicate some, but not all, of the data in both collections. We only duplicate data that is frequently accessed together.

For example, if the application has a user profile page that displays information about the user as well as the name of the building and region where they work in, we integrate the building name and region into the employee document but the other building-related info in a building specific collection.


The outlier pattern model

The outlier pattern: only a few documents have a huge amount of embedded documents.

https://www.mongodb.com/blog/post/building-with-patterns-the-outlier-pattern

Consider a collection of books and the list of users who bought the book.

{
    "_id": ObjectID("507f1f77bcf86cd799439011")
    "title": "A boring story",
    "author": "Sam K. Boring",
    …,
    "customers_purchased": ["user00", "user01", "user02"]

}

Most books only sell a few copies. This is the long tail of book sales.

For most books we can simply embed the list of buyers (ID and some relevant info) in the book document.

A small number of books sell millions of copies. Impossible to nest buyers in the book doc.

By adding a field, a flag, or indicator, that signals that the book is very popular, we can adapt the schema according to this indicator.


{
    "_id": ObjectID("507f191e810c19729de860ea"),
    "title": "Harry Potter",
    "author": "J.K. Rowling",
    …,
    // we avoid integrating buyers for this book
    //    "customers_purchased": ["user00", "user01", "user02", …, "user9999"],
   "outlier": "true"
}

In the application code, we test for the presence of this indicator and handle the data differently if the indicator is present. For example by referencing buyers of very popular books instead of embedding them.

The outlier pattern is frequently used in situations where popularity is a factor, such as in social media relationships, book sales, movie reviews,


Other patterns

The Bucket Pattern

πŸ“Œ When to use it?

βœ… Example: Store sensor readings in buckets

Instead of creating a new document for every sensor reading, group them into time buckets.

{
  "_id": "sensor_1_2024-02-05",
  "sensor_id": 1,
  "date": "2024-02-05",
  "readings": [
    { "time": "08:00", "value": 22.5 },
    { "time": "08:30", "value": 23.1 }
  ]
}

The Computed Pattern

πŸ“Œ When to use it?

βœ… Example: Store total order value instead of computing every time

{
  "_id": 101,
  "user_id": 1,
  "items": [
    { "name": "Laptop", "price": 1200 },
    { "name": "Mouse", "price": 50 }
  ],
  "total_price": 1250
}

The Polymorphic Pattern

πŸ“Œ When to use it?

βœ… Example: Store users and admins in the same collection

{
  "_id": 1,
  "type": "user",
  "name": "Alice",
  "email": "[email protected]"
}

{
  "_id": 2,
  "type": "admin",
  "name": "Bob",
  "email": "[email protected]",
  "admin_permissions": ["manage_users", "delete_posts"]
}

Recap

PatternUse CaseExample
EmbeddingSmall, frequently accessed dataStore a user with their addresses
ReferencingLarge, reusable data needing updatesUsers & orders stored separately
BucketTime-series or grouped dataSensor readings stored in time buckets
OutlierAvoids large documents with exceptionsMove excess comments to a separate collection
ComputedAvoids expensive calculationsStore total order price in the document
PolymorphicMultiple object types in one collectionUsers & admins stored together

πŸ’‘ Conclusion

Choosing the right pattern depends on query patterns, data size, and update frequency. πŸš€


Schema definition and validation

Without explicit validation it is possible to put anything in a collection.

For example, the user's age

etc etc.

The absence of validation regarding acceptable values in a collection transfers complexity to the application level.

It's chaos!


Implicit schema during creation

In MongoDB, collections are created automatically when a first document is inserted into a non-existent collection. It is not necessary to create it manually beforehand.


πŸ“Œ Example

db.characters.insertOne({ name: "Alice", age: 25 });

βœ… If the characters collection doesn't exist, MongoDB creates it automatically and inserts the document.


πŸ”Ή When should you create a collection manually?

You might want to use createCollection() if you need to:

  1. Define specific parameters like a capped collection (fixed size).
  2. Apply validation rules to ensure data integrity.

Example of explicit collection creation

db.createCollection("users", {
  capped: true,
  size: 1024
});

βœ… This creates a capped collection with a fixed size of 1024 bytes.


Capped collection

A capped collection maintains insertion order and automatically overwrites the oldest documents when the size limit is reached. Think of it like a circular buffer or ring buffer: once it's full, new data pushes out the oldest data.

Very fast insertions (no need to allocate new space)

Great for server logs, message queues, real-time data streams for instance.

πŸ”Ή Limitations of automatic creation

πŸš€ Conclusion: Yes, MongoDB automatically creates collections when inserting data, but explicit creation is useful for advanced parameters.


Applying a data type

Declare the data type when defining the schema

db.createCollection("movies", {
    validator: {
        $jsonSchema: {
            bsonType: "object",
            required: ["year", "title"],  // Fields that must be present
            properties: {
                year: {
                    bsonType: "int",  // Forces `year` to be an integer
                    required: "true",
                    description: "Must be an integer and is required"
                },
                title: {
                    bsonType: "string",
                    description: "Movie title, must be a string if present"
                },
                imdb: {
                    bsonType: "object",  // Nested object for IMDb data
                    properties: {
                        rating: {
                            bsonType: "double",  // IMDb rating must be a float
                            description: "Must be a float if present"
                        }
                    }
                }
            }
        }
    }
})

Key points:

MongoDB supports schema validation starting from version 3.6, which allows you to enforce data types and other constraints on fields within a collection.

This is achieved using the $jsonSchema operator when creating or updating a collection.

https://www.digitalocean.com/community/tutorials/how-to-use-schema-validation-in-mongodb

When you add validation rules to an existing collection, the new rules will not affect existing documents until you try to modify them.


Schema Validation in MongoDB

MongoDB is known for its flexibility - you can store documents without predefined structures. However, as your application grows, you might want to ensure your data follows certain rules. This is where schema validation comes in.

How Schema Validation Works

When you create a collection with validation, MongoDB will check every new document (and updates to existing ones) against your rules. Here's what the basic structure looks like:

db.createCollection("collectionName", {
   validator: {
      $jsonSchema: {
         bsonType: "object",
         required: ["field1", "field2"],
         properties: {
            field1: { type: "string" },
            field2: { type: "number" }
         }
      }
   }
})

The $jsonSchema keyword tells MongoDB that we're using JSON Schema validation. Inside this schema, we define our rules using various building blocks.


Building Blocks of Validation

The most fundamental components are:

First, we specify which fields are mandatory using required. These fields must be present in every document.

Next, we define properties - this is where we describe what each field should look like. For each property, we can specify its type and additional rules. For example, if you're storing someone's age, you might want to ensure it's always a number and perhaps even set a minimum / maximum value.

Let's look at how we handle more complex structures


Nested Objects

Sometimes your data has natural hierarchies. For instance, an address isn't just one piece of information - it has streets, cities, and zip codes. Here's how you validate nested structures:

properties: {
   address: {
      bsonType: "object",
      required: ["city"],      // City is mandatory in addresses
      properties: {
         city: { type: "string" },
         zip: { type: "string" }
      }
   }
}

Fine-Tuning Validation Behavior

MongoDB gives you control over how strict your validation should be. You can set two important behaviors:

The validationAction determines what happens when a document fails validation:

The validationLevel controls when validation happens:

Remember that validation only happens when documents are modified or inserted. Existing documents won't be validated until you try to update them. This makes it safe to add validation to collections that already contain data.

Through schema validation, MongoDB offers a balance between flexibility and control. You can start with a loose schema during early development, then gradually add more validation rules as your application's needs become clearer. This progressive approach to data quality helps ensure your database remains both reliable and adaptable.


Practice

Small example on tree datasets

{
        "idbase":249403,
        "location_type":"Tree",
        "domain":"Alignment",
        "arrondissement":"PARIS 20E ARRDT",
        "suppl_address":"54",
        "number":null,
        "address":"AVENUE GAMBETTA",
        "id_location":"1402008",
        "name":"Linden",
        "genre":"Tilia",
        "species":"tomentosa",
        "variety":null,
        "circumference":85,
        "height":10,
        "stage":"Adult",
        "remarkable":"NO",
        "geo_point_2d":"48.86685102642415, 2.400262189227641"
    },

write a schema and validator for this data

the validator must apply


Practice

take a JSON version of the trees dataset where columns with null values have been removed

  1. without constraints
    • insert all data without validation
    • query to check absurd values: height, geolocation
  2. more controlled approach
    • write a schema and validator: using MongoDB's $jsonSchema validator to "dry run" your data validation.
    • check the number of documents ignored
    • write a validator that recovers as many documents as possible while excluding absurd values
  3. index
    • add a unique index on geolocation

in MongoDB, the schema validator is usually created with the collection, not as a separate step


Links

Schema design patterns: https://learn.mongodb.com/courses/schema-design-patterns

1 / 0