to organize complex structured data
Spotify example
Social Network example
SQL in high demand: In 2025 job data, SQL appears in ~26% of software engineer listings and ~80% of data engineering roles — nearly on par with Python.
Core integration skill: SQL is a core integration skill, powering pipelines, ETL, and database queries across backend, analytics, and engineering—not just data engineering.
Design & Modeling: Tools like ModelCenter (used by NASA, Airbus) and DLR’s RCE orchestrate CAD, simulation, and optimization workflows, backed by relational/SQL data handling.
Digital Engineering & Integration: Aerospace is adopting digital twins, predictive maintenance, and model-based systems engineering (MBSE), where SQL supports structured data management, cross-tool integration, and traceability.
Maintenance & Operations (MRO): SQL is key for querying structured aircraft maintenance data;
in short:
Big software is SQL based
A digital twin is a virtual model of a physical object, system, or process that is continuously updated with real-world data.
It’s not just a simulation — it’s a living, data-driven model that mirrors the current state, behavior, and performance of its real-world counterpart.
Data Ingestion & Storage
Querying Real-Time States
Integration with Simulation & Modeling Tools
Predictive Analytics Pipelines
Lifecycle & Traceability
✅ SQL is the backbone of structured data management for digital twins.
It enables the ingestion, storage, querying, and integration of massive telemetry streams with design and simulation models.
Without SQL (or SQL-like systems), digital twins would struggle to scale or maintain traceability.
By the way, this applies to Finance, Health, Energy, etc…
Why use a database instead of just a spreadsheet or files like json or csv ?
Sometimes we hear the word database
used to describe an Excel
file. How can we put something as simple as a CSV or Excel file on the same level as these engineering marvels that are PostgreSQL, Weaviate, MongoDB, Neo4j, Redis, MySQL, etc…
So I asked my friend GPT-5 to give me a definition of a database:
In simple terms: “A database is like a smart notebook or filing system that helps you keep track of lots of information and find exactly what you need in no time.”
which definitely includes CSV files, Excel files, JSON files, XML files, and many other simple file-based formats.
From the Encyclopedia Britannica, we get:
database, any collection of data or information specifically organized for rapid search by computer. Databases are structured to facilitate storage, retrieval, modification, and deletion of data.
See also the Database article on Wikipedia.
No longer just talking about quickly finding information (the search part) but also about:
This is where a simple spreadsheet file no longer meets the objective.
Feature | Description | Excel | DBMS |
---|---|---|---|
Data Storage and Retrieval | Stores data in an organized manner and retrieves it as needed. | ✅ | ✅ |
Data Manipulation | Allows adding, modifying, or deleting data. | ✅ | ✅ |
Data Querying | Allows asking complex questions (queries) about the data. | ✔️ | ✅ |
Data Organization | Structures data in formats such as tables, documents, or graphs to facilitate management. | ✅ | ✅ |
Data Sharing | Allows multiple users or applications to use the database simultaneously. | ✅ | ✅ |
Data Security | Protects data against unauthorized access or corruption. | ✔️ | ✅ |
Concurrency Control | Manages multiple users modifying data at the same time without conflicts. | ✅ | |
Backup and Recovery | Ensures that data is not lost and can be restored in case of failure. | ✔️ | ✅ |
Data Integrity | Ensures that data remains accurate, consistent, and reliable. | X | ✅ |
Performance Optimization | Provides tools to optimize the speed and efficiency of data retrievals and updates. | X | ✅ |
Support for Transactions | Ensures that a group of operations (transactions) is completed entirely or not at all. ACID compliance |
X | ✅ |
A key feature of a real DBMS is ACID compliance.
1980s - Domination of Relational
More inportantly:
And in 2009, 2 new NoSQL databases are launched: 🥭 MongoDB 🥭 and 🎉 Neo4j 🎉
The rise of the world wide web (myspace - 2003 😍, youtube - 2005) and the massive increase in application scale by several orders of magnitude.
Suddenly, we have millions of people simultaneously trying to access and modify Terabytes of data in milliseconds.
Relational databases cannot keep up with the scale of applications, the chaotic nature of unstructured data, and the speed requirements.
The promise of NoSQL is volume and speed.
2010s - NoSQL Matures & Specialization - Big Data and Specialized Databases
and meanwhile, in 2013, Docker containers revolutionize database deployment
2020s: AI, Vector Databases, and Real-Time Needs
Vector search is booming and most existing DBMS, including PostgreSQL, MongoDB, and Neo4j are integrating vector search capabilities.
A relational database stores information in tables made up of rows
and columns
.
row
is a recordcolumn
is a fieldkeys
, so you can relate data across themThat’s why it’s called relational, the power comes from defining and using these relationships.
Main Categories of Databases Today
We have many databases to choose from. It all depends on scale, application nature, budget, etc.
Database Type | Purpose | Examples | Application |
---|---|---|---|
Relational - SQL | Fixed schema | PostgreSQL, MySQL, Oracle | Transactions, normalization |
Document Stores | Flexible schema, JSON-like documents | MongoDB, CouchDB | Web applications, content management |
Graph Databases | Relationship-centered data | Neo4j, ArangoDB | Social networks, recommendation engines |
Key-Value Stores | Simple and fast lookups | Redis, DynamoDB | Caching, session management |
Vector Databases | Similarity search, AI embeddings | Pinecone, Weaviate | AI applications, semantic search |
Column-Family Stores | Wide-column data, high scalability | Cassandra, HBase | Time-series, big data applications |
Time-Series Databases | Time-ordered data | InfluxDB, TimescaleDB | IoT, monitoring systems |
Check the ranking of all databases at https://db-engines.com/en/ranking
Trends: https://db-engines.com/en/ranking_trend
source: https://www.generativevalue.com/p/a-primer-on-databases
Also see this interactive map that lists all players in 2023.
Advantages
✅ Free to use
✅ Transparent (view + audit code)
✅ Customizable
✅ Strong community support
✅ No vendor lock-in
Disadvantages
❌ May need more setup & tuning
❌ Support = community or paid 3rd party
❌ Fewer enterprise features (sometimes)
Advantages
✅ Enterprise-grade support
✅ Robust tooling & integrations
✅ Optimized performance at scale
✅ Security/compliance features
Disadvantages
❌ Expensive licensing
❌ Closed codebase
❌ Vendor lock-in risk
❌ Limited customization
A database is a representation of a complex reality (ex: a company, an aircraft fleet, a maintenance service).
Modeling makes it possible to:
👉 Before creating data tables, the problem must be analyzed and structured.
MERISE is a modelisation framework that extends beyond databases to processes and project management
MERISE is rarely used today and if so only in France.
There are many more up to date modelisation methods
Nonetheless the concepts are similar
and this is part of the final exam.
also chatGPT knows everything about MERISE
Modern alternatives
Name | Main Concept | Advantages |
---|---|---|
UML (Unified Modeling Language) | Object-oriented diagrams for data + behavior | Matches modern software design, flexible, widely adopted |
Anchor Modeling | Agile data modeling with anchors, attributes, ties | Handles evolving schemas, temporal data, automation-ready |
Data Vault | Hubs, links, satellites for enterprise data warehouses | Scalable, auditable, integrates diverse sources |
Evolutionary Database Design | Iterative schema changes, migrations, CI/CD | Fits Agile workflows, supports continuous delivery |
Modern Tools (PowerDesigner, ER/Studio, dbdiagram.io…) | Collaborative modeling platforms | Automation, documentation, real-time teamwork |
Software development
Merise (like Waterfall):
Modern alternatives: UML, Anchor Modeling, Evolutionary DB Design, Data Vault, etc.
Better suited for today’s fast-changing, data-driven projects.
(a bit more Agile’ish):
Merise is to Waterfall
what modern modeling techniques are to Agile
— the first focuses on heavy planning and rigid steps, while the second focuses on adaptability and iterative improvement.
Merise : (Méthode d’Étude et de Réalisation Informatique par les Sous-Ensembles ou pour les Systèmes d’Entreprises).
aka : Méthode Éprouvée pour Retarder Indéfiniment la Sortie des Études (source wikipedia)
The overall approach to building an Information System (IS):
Three levels of concern
Each stakeholder has a different perspective: (examples: sales, marketing, billing)
Merise aims to provide representations suited to each level with:
2 families of models
3 levels of Information System (IS) description
6 models
A graphical and structured representation of an IS’s information.
Based on two notions: entities and associations, also called an Entity/Association diagram.
Steps to build it:
The CDM provides an interactive graphical representation that helps you understand the interrelations of different elements using coded diagrams.
At the start of database design, the CDM then leads to the PDM and LDM
from : MCD
The Data Dictionary is a document that lists and describes all the data of an information system. Serves as a common reference between analysts, developers, and users.
For each data item, specify:
Type:
Code | Designation | Type | Size | Constraints / Rules |
---|---|---|---|---|
id_client | Client ID | INT | - | Primary key, unique, not null |
nom | Client name | VARCHAR | 50 | Mandatory |
date_naissance | Birth date | DATE | - | >= 1900-01-01 and <= current date |
Example — Data Dictionary (CLIENT)
📌 In modern DBMS, SQL types like CHAR/VARCHAR are used instead of A, or INT/DECIMAL instead of N.
Entity Object (concrete or abstract) through which the organization (company, administration, etc.) recognizes an independent existence. (ex: supplier, client, order, airplane, etc.)
Property Characteristics of the entity (ex: name, price, date)
Identifier Special property that distinguishes each occurrence (ex: supplier No. 234, employee X456, etc.)
Entity: Book | name of the entity |
---|---|
Book_id | Identifier |
Title | Property |
Genre | Property |
Author_id (Property) | Property |
MCD – Basic Concepts (continued)
Relation Links between two (or more) entities (ex: Client places Order)
Cardinality Number of times (minimum, maximum) an entity is involved in a relation.
Value constraints:
Business rules: Define how the system works:
Functional dependencies (FD)
Between properties:
Between entities:
👉 Use: check consistency, avoid redundancy, help normalization
Uniqueness of data Each piece of data appears only once. If duplicate → they represent different realities.
Simplify relations Replace an n-dimension relation with several smaller relations.
Eliminate calculated data Example: birth date + today’s date ⇒ age
Graphical representation of the CDM Entity – Relation
Types of relations
Types of relations
Types of relations
Purpose: to translate the Conceptual Data Model (CDM/MCD) into a logical form adapted to a database (relational, hierarchical, network, depending on the DBMS).
What it contains:
Characteristics:
The CDM (MCD) describes what to store and the relationships, at a conceptual level.
The LDM (MLD) describes how to organize this data for storage in a DBMS (mainly relational).
Concepts at the MLD level:
Table (or relation)
Structured set of data organized in a table
Record Set of table properties associated with a key (row of a table, with key)
Key Property (or group of properties) uniquely identifying one entity among others of the same type. (ex: one employee among all employees)
View Grouping of properties (columns) from one or more tables corresponding to a specialized perspective of reality.
Transformation rules:
Each entity becomes a table
Each relation becomes either:
{0,1} vs {1,1} : optional vs mandatory <=> add constraints with NULL
ok or NOT NULL
CREATE TABLE Person (
person_id INT PRIMARY KEY,
name VARCHAR(50)
);
CREATE TABLE Passport (
passport_id INT PRIMARY KEY,
person_id INT UNIQUE, -- one-to-one
FOREIGN KEY (person_id) REFERENCES Person(person_id),
-- If mandatory:
person_id INT UNIQUE NOT NULL,
-- If optional:
person_id INT UNIQUE NULL
);
Graphical representation of tables
⚠️ There are no direct links between tables. Dashed lines may help understanding, but it’s the pairs “primary key – foreign key” that represent the relation from the CDM.
Example: key of T1 is property 4 of T2
More transformation rules from CDM to LDM.
Tables include:
Schemas (text representation):
PLANE (
id_plane : integer [primary key, required],
model : text(50) [required],
flight_hours : integer [optional]
)
Tuples:
After analysis, the following rules were identified:
create the MCD and MLD for this database.
intermission
To normalize a table means to apply one or more decompositions to eliminate potential internal redundancies.
Based on the notion of keys and functional dependencies.
There are 6 common normal forms (1NF, 2NF, 3NF, 4NF, 5NF, and Boyce-Codd Normal Form).
(in fact there are many more normalization rules see wikipedia normalization)
But then you have to handle duplicates of the data: inserts, updates, deletes etc
Orders(order_id, order_date, customer_id, customer_name, customer_address) Invoices(invoice_id, order_id, customer_id, customer_name, customer_address, amount)
Both Orders and Invoices store customer_name and customer_address.
The same customer info is duplicated in two tables.
makes sense if you quickly want to see the customer name and address in the orders and invoices in a dashboard.
More formally, a database is normalized if:
all column values depend only on the table primary key,
aka
each piece of information lives in just one place
Definition: A table is in 1NF if every attribute contains an atomic value. Lists, arrays, or complex structures cannot be attribute values.
attribute = value
actors : ['actor1', 'actor2', 'actor3'] => not in 1NF
actor : 'actor1' => ok
Solution 1: Create multiple attributes (horizontal storage).
actor_01 : 'actor1'
actor_02 : 'actor2'
actor_03 : 'actor3'
This does not scale and is a terrible solution!
etc
Solution 2: Create a new table for the multivalued attribute (vertical storage, cleaner).
Movie table : id, title, duration
Actor table : id, name, movie_id
one movie has many actors
Table schema:
PERSON (idPerson, name, firstName, address, cars)
Decomposition into tables:
PERSON (idPerson, name, firstName, streetNo, streetName, postalCode, city) CAR (idCar, model, brand, owner)
Note : The 1NF is not compatible with arrays or objects data types. But they exist and are 100% legit.
=> use common sense when applying normal forms
sometimes it’s better to list the values for a given record than to create a new table.
Definition: A relation is in 2NF if and only if:
non-key attribute
depends on the entire key, not just part of it👉 Removes redundancy
👉 Only concerns tables with a composite primary key (multiple attributes)
A key in a database is a column (or a set of columns) that uniquely identifies a row in a table.
Simple (single) key: one column does the job.
Example: student_id in a Students table.
Composite key: two or more columns together make it unique.
Example: (order_id, product_id) in an OrderDetails table.
👉 If you know the key, you can point to exactly one row, no confusion.
This creates redundancy : the same value appears multiple times in the table.
2NF: all non-key attributes depend on the entire key
Attention: 1NF + non composite key => 2NF
ASSIGNMENT (idPerson, idInstitution, name, firstName, institutionName)
ASSIGNMENT (idPerson, idInstitution, name, firstName, institutionName)
Is it in 2NF?
if the key is (idPerson, idInstitution) then institutionName only depends on idInstitution, not idPerson so the dependency is partial
if the key is just idPerson, then no problem since there is no partial dependency
https://en.wikipedia.org/wiki/Second_normal_form
_A relation (or a table, in SQL) is in 2NF if it is in first normal form (1NF) and contains no partial dependencies.
A partial dependency occurs when a non-prime attribute (that is, one not part of any candidate key) is functionally dependent on only a proper subset of the attributes making up a candidate key._
many to many relations
PERSON (idPerson, name, firstName)
INSTITUTION (idInstitution, institutionName, brand)
ASSIGNMENT (idPerson, idInstitution)
Or with idInstitution
as the foreign key : a person blongs to only one institution (one to many)
PERSON (idPerson, name, firstName, idInstitution)
INSTITUTION (idInstitution, institutionName, brand)
Definition: A relation is in 3NF if and only if:
Formally: For every dependency A → X, either A is a superkey or X is part of a key.
Orders(
order_id INT PRIMARY KEY,
customer_id INT,
customer_name VARCHAR(50),
customer_city VARCHAR(50),
city_zipcode VARCHAR(10)
)
where order_id
is the key (not order_id + customer_id
). the key is not composite. so it is in 2NF
❌ But not in 3NF:
city_zipcode
depends on customer_city
, not directly on the primary key (order_id
).This is a transitive dependency:
order_id → customer_city → city_zipcode
Split into 3 separate tables:
Customers(
customer_id INT PRIMARY KEY,
customer_name VARCHAR(50),
city_id INT,
FOREIGN KEY (city_id) REFERENCES Cities(city_id)
)
Cities(
city_id INT PRIMARY KEY,
city_name VARCHAR(50),
city_zipcode VARCHAR(10)
)
Orders(
order_id INT PRIMARY KEY,
customer_id INT,
FOREIGN KEY (customer_id) REFERENCES Customers(customer_id)
)
👉 Now:
Orders
→ linked to CustomersCustomers
→ linked to Citiescity_zipcode
depends only on city_id
(the key of Cities), not on a non-key attribute.So this design is in 3NF.
Example table schema:
Tracks(track_id, title, album_id, album_name, artist_id, artist_name)
where track_id
is the key
Decompose the relation into 2 tables:
CAR (regNo, type, color)
MODEL (type, brand, power)
1st Normal Form: For each relation, state if it is in 1NF.
2nd Normal Form: For each relation, state if it is in 2NF.
3rd Normal Form: For each relation, state if it is in 3NF.
Merise method:
🎯 Goal: go from business problem to a reliable relational database