MongoDB schema design is the most critical part of database design.
If you design the MongoDB schema like a relational schema, you lose sight of the advantages and power of NoSQL.
Read the article: MongoDB Schema Design: Best Practices for Data Modeling
Document databases have a rich vocabulary that is capable of expressing data relationships in a more nuanced way than SQL.
There are many things to consider when choosing a schema.
In SQL databases, normalization consists of distributing data across tables to avoid duplication.
In SQL: multiple tables, apply normalization forms (NF1, NF2, …), etc.
With MongoDB, there are:

Which is absolutely not scary !!! 😳😳😳
What matters is designing a schema that works best for the final application.
Two different applications that use exactly the same data can have very different schemas if the data is used differently.
🐲🐲🐲 The application dictates the schema! 🐲🐲🐲
The general goal of normalization is to reduce data redundancy and dependency by organizing data into separate, related tables.
More formally, a database is normalized if:
all column values depend only on the table primary key,
With Denormalization, the idea is to have data redundancy to simplify queries and make OLAP queries faster.
Redundant data : the same data / info exists in multiple tables
A normal form is a rule that defines a level of normalization.
Each field contains a single value
A relation is in first normal form
if and only if
no attribute domain has relations as elements.
The table is in 2NF iff :
Some cases of non-compliance with 2NF
Employee(EmployeeID, Name, BirthDate, Age)Age is derived from BirthDate, causing a partial dependency.A relation R is in 3NF if and only if both of the following conditions hold:
A transitive dependency occurs when a non-prime attribute (an attribute that is not part of any key) depends on another non-prime attribute, rather than depending directly on the primary key.
where:
In simple terms: A → B → C where A is the primary key, but C depends on B instead of directly on A.
For instance in a songs table, we have
song_id → artist_name → artist_countrysong_id → artist_name → artist_birth_yearThis leads to data redundancy (Artist’s info repeated), update anomalies (must change multiple rows to update one artist), and maintenance headaches.
When designing the schema for a SQL database, it helps to spot anomalies
Embedding documents in MongoDB nests related data within a single document -> for efficient retrieval and simplified data representation.
{
"title": "Paris Metro Stations",
"stations": [
{
"name": "Châtelet",
"lines": [
{ "number": "1", "direction": "La Défense" },
{ "number": "4", "direction": "Porte de Clignancourt" },
{ "number": "7", "direction": "La Courneuve - 8 Mai 1945" },
{ "number": "11", "direction": "Mairie des Lilas" },
{ "number": "14", "direction": "Olympiades" }
],
"connections": ["RER A", "RER B", "RER D"],
"accessibility": true
},
{
"name": "Gare du Nord",
"lines": [
{ "number": "4", "direction": "Mairie de Montrouge" },
{ "number": "5", "direction": "Bobigny - Pablo Picasso" }
],
"connections": ["RER B", "RER D", "Eurostar", "Thalys"],
"accessibility": true
}
]
}
Document size negatively impacts query performance. We must be careful not to put everything in a document, but restrict to relevant information.
MongoDB documents are limited to a size of 16 MB.
There is therefore a balance to be found between information completeness and document size
Reference another document using the unique object ID of a document and connect them using the $lookup operator.
Works the same way as the JOIN operator in an SQL query.
We must consider the nature of relationships between entities
Modeled as key-value pairs in the database.
For example:
A small sequence of data associated with the main entity.
Potentially millions of enbedded documents
A MongoDB record cannot exceed 16 MB. In the case of One to Millions, this might become problematic.
Cnsider a server logging application where each server records a significant amount of events.
So we have 2 entities: server and event.
3 options
server document integrates all events associated with the server. high probability that this will exceed 16 MB per document quite quickly$lookup. But it’s slower to retrieve all events from a serverserver name into each event document! data duplication but query speed and no risk of exceeding the 16MB sizeExample of a project planning application:
An efficient schema is to store
Users:
{
"_id": ObjectID("AAF1"),
"name": "Kate Pineapple",
"tasks": [ObjectID("ADF9"), ObjectID("AE02"), ObjectID("AE73")]
}
Tasks:
{
"_id": ObjectID("ADF9"),
"description": "Write a blog post about MongoDB schema design",
"due_date": ISODate("2014-04-01"),
"owners": [ObjectID("AAF1"), ObjectID("BB3G")]
}
In this example, you can see that each user has a sub-array of related tasks, and each task has a sub-array of owners for each item in our task application.
Let’s examine 2 Schema models to illustrate how the application dictates data schema design.
Let’s review this article
https://www.mongodb.com/developer/products/mongodb/schema-design-anti-pattern-massive-arrays/
One of the rules of thumb when modeling data in MongoDB is to say that data that is accessed at the same time should be stored together.
-> A building has many employees: potentially too many for the 16 MB document limit.
We reverse the situation with
-> The employee belongs to a building: we integrate the building information into the employee document.
If the application frequently displays information about an employee and their building together, this model is probably wise.
Problem: we have way too much data duplication.
Updating a building’s information involves updating all employee documents.
So, let’s separate employees and building into 2 distinct collections and use $lookups.
But $lookups are expensive.
We therefore use the extended reference pattern where we duplicate some, but not all, of the data in both collections. We only duplicate data that is frequently accessed together.
For example, if the application has a user profile page that displays information about the user as well as the name of the building and region where they work in, we integrate the building name and region into the employee document but the other building-related info in a building specific collection.
The outlier pattern: only a few documents have a huge amount of embedded documents.
https://www.mongodb.com/blog/post/building-with-patterns-the-outlier-pattern
Consider a collection of books and the list of users who bought the book.
{
"_id": ObjectID("507f1f77bcf86cd799439011")
"title": "A boring story",
"author": "Sam K. Boring",
…,
"customers_purchased": ["user00", "user01", "user02"]
}
Most books only sell a few copies. This is the long tail of book sales.
For most books we can simply embed the list of buyers (ID and some relevant info) in the book document.
A small number of books sell millions of copies. Impossible to nest buyers in the book doc.
By adding a field, a flag, or indicator, that signals that the book is very popular, we can adapt the schema according to this indicator.
{
"_id": ObjectID("507f191e810c19729de860ea"),
"title": "Harry Potter",
"author": "J.K. Rowling",
…,
// we avoid integrating buyers for this book
// "customers_purchased": ["user00", "user01", "user02", …, "user9999"],
"outlier": "true"
}
In the application code, we test for the presence of this indicator and handle the data differently if the indicator is present. For example by referencing buyers of very popular books instead of embedding them.
The outlier pattern is frequently used in situations where popularity is a factor, such as in social media relationships, book sales, movie reviews,
📌 When to use it?
✅ Example: Store sensor readings in buckets
Instead of creating a new document for every sensor reading, group them into time buckets.
{
"_id": "sensor_1_2024-02-05",
"sensor_id": 1,
"date": "2024-02-05",
"readings": [
{ "time": "08:00", "value": 22.5 },
{ "time": "08:30", "value": 23.1 }
]
}
📌 When to use it?
✅ Example: Store total order value instead of computing every time
{
"_id": 101,
"user_id": 1,
"items": [
{ "name": "Laptop", "price": 1200 },
{ "name": "Mouse", "price": 50 }
],
"total_price": 1250
}
📌 When to use it?
✅ Example: Store users and admins in the same collection
{
"_id": 1,
"type": "user",
"name": "Alice",
"email": "[email protected]"
}
{
"_id": 2,
"type": "admin",
"name": "Bob",
"email": "[email protected]",
"admin_permissions": ["manage_users", "delete_posts"]
}
| Pattern | Use Case | Example |
|---|---|---|
| Embedding | Small, frequently accessed data | Store a user with their addresses |
| Referencing | Large, reusable data needing updates | Users & orders stored separately |
| Bucket | Time-series or grouped data | Sensor readings stored in time buckets |
| Outlier | Avoids large documents with exceptions | Move excess comments to a separate collection |
| Computed | Avoids expensive calculations | Store total order price in the document |
| Polymorphic | Multiple object types in one collection | Users & admins stored together |
$lookup).Choosing the right pattern depends on query patterns, data size, and update frequency. 🚀
Without explicit validation it is possible to put anything in a collection.
For example, the user’s age
etc etc.
The absence of validation regarding acceptable values in a collection transfers complexity to the application level.
It’s chaos!
In MongoDB, collections are created automatically when a first document is inserted into a non-existent collection. It is not necessary to create it manually beforehand.
📌 Example
db.characters.insertOne({ name: "Alice", age: 25 });
✅ If the characters collection doesn’t exist, MongoDB creates it automatically and inserts the document.
You might want to use createCollection() if you need to:
db.createCollection("users", {
capped: true,
size: 1024
});
✅ This creates a capped collection with a fixed size of 1024 bytes.
A capped collection maintains insertion order and automatically overwrites the oldest documents when the size limit is reached. Think of it like a circular buffer or ring buffer: once it’s full, new data pushes out the oldest data.
Very fast insertions (no need to allocate new space)
Great for server logs, message queues, real-time data streams for instance.
find()._id).🚀 Conclusion: Yes, MongoDB automatically creates collections when inserting data, but explicit creation is useful for advanced parameters.
Declare the data type when defining the schema
db.createCollection("movies", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["year", "title"], // Fields that must be present
properties: {
year: {
bsonType: "int", // Forces `year` to be an integer
required: "true",
description: "Must be an integer and is required"
},
title: {
bsonType: "string",
description: "Movie title, must be a string if present"
},
imdb: {
bsonType: "object", // Nested object for IMDb data
properties: {
rating: {
bsonType: "double", // IMDb rating must be a float
description: "Must be a float if present"
}
}
}
}
}
}
})
MongoDB supports schema validation starting from version 3.6, which allows you to enforce data types and other constraints on fields within a collection.
This is achieved using the $jsonSchema operator when creating or updating a collection.
https://www.digitalocean.com/community/tutorials/how-to-use-schema-validation-in-mongodb
When you add validation rules to an existing collection, the new rules will not affect existing documents until you try to modify them.
MongoDB is known for its flexibility - you can store documents without predefined structures. However, as your application grows, you might want to ensure your data follows certain rules. This is where schema validation comes in.
When you create a collection with validation, MongoDB will check every new document (and updates to existing ones) against your rules. Here’s what the basic structure looks like:
db.createCollection("collectionName", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["field1", "field2"],
properties: {
field1: { type: "string" },
field2: { type: "number" }
}
}
}
})
The $jsonSchema keyword tells MongoDB that we’re using JSON Schema validation. Inside this schema, we define our rules using various building blocks.
The most fundamental components are:
First, we specify which fields are mandatory using required. These fields must be present in every document.
Next, we define properties - this is where we describe what each field should look like. For each property, we can specify its type and additional rules. For example, if you’re storing someone’s age, you might want to ensure it’s always a number and perhaps even set a minimum / maximum value.
Let’s look at how we handle more complex structures
Sometimes your data has natural hierarchies. For instance, an address isn’t just one piece of information - it has streets, cities, and zip codes. Here’s how you validate nested structures:
properties: {
address: {
bsonType: "object",
required: ["city"], // City is mandatory in addresses
properties: {
city: { type: "string" },
zip: { type: "string" }
}
}
}
MongoDB gives you control over how strict your validation should be. You can set two important behaviors:
The validationAction determines what happens when a document fails validation:
The validationLevel controls when validation happens:
Remember that validation only happens when documents are modified or inserted. Existing documents won’t be validated until you try to update them. This makes it safe to add validation to collections that already contain data.
Through schema validation, MongoDB offers a balance between flexibility and control. You can start with a loose schema during early development, then gradually add more validation rules as your application’s needs become clearer. This progressive approach to data quality helps ensure your database remains both reliable and adaptable.
Small example on tree datasets
{
"idbase":249403,
"location_type":"Tree",
"domain":"Alignment",
"arrondissement":"PARIS 20E ARRDT",
"suppl_address":"54",
"number":null,
"address":"AVENUE GAMBETTA",
"id_location":"1402008",
"name":"Linden",
"genre":"Tilia",
"species":"tomentosa",
"variety":null,
"circumference":85,
"height":10,
"stage":"Adult",
"remarkable":"NO",
"geo_point_2d":"48.86685102642415, 2.400262189227641"
},
write a schema and validator for this data
the validator must apply
take a JSON version of the trees dataset where columns with null values have been removed
in MongoDB, the schema validator is usually created with the collection, not as a separate step
Schema design patterns: https://learn.mongodb.com/courses/schema-design-patterns