Intermediate databases - Epita 2025


Who am I?

Alexis Perrier

  • PhD TelecomParis 95'
  • Developper, Teacher, Author, Data scientist
  • cycling, staying fit, scifi, nerd

The Course

--- # Course organization

15 sessions - 2h


Course content


What is this course about?

What you will be able to do:

=> which in turn should bring good health, fortune and happiness


Curriculum / Scope

Advanced Database Design: understanding of normalization and denormalization concepts to optimize database structure



Ifs

if we have time : Common cloud services; Google BigQuery
If there's a specific subject you want to address let me know.

datasets

Real datasets!

  • Trees of Paris from Paris open data
  • Ademe: building energy efficiency
  • Airdb : flights, passengers, airports
  • Classic kaggle datasets
trees
---

Questions ?

---

Getting to know you

Just a few questions to know more about you

form [forms.gle/VrpvjaPGEKCUji4Q9](https://forms.gle/VrpvjaPGEKCUji4Q9)

Today

Goals


resources

Lots of good resources around

Postgres tutorials

The best resource is the excellent documentation


Why SQL ? Why PostgresQL ?

---

Why a database ?

organize complex structured data

**Spotify** example - users, subscriptions - devices - songs - artists, albums - playlists - plays - etc …
**Social Network** example - users, subscriptions - followers - engagement - posts - media - etc …

SQL is everywhere


database vs excel file ?

Discussion : Why use a database instead of just … files like json, or excel / csv ?

what is a database asks Sheldon?

What is a database?

Sometimes we hear the word database used to describe an Excel file.

How can we put something as simple as a CSV or Excel file on the same level as these engineering marvels that are **PostgreSQL**, Weaviate, **MongoDB**, Neo4j, **Redis**, MySQL, etc... ?

A definition

So I asked my friend GPT-4 to give me a definition of a database:

In simple terms: "A database is like a smart notebook or filing system that helps you keep track of lots of information and find exactly what you need in no time."

which definitely includes CSV files, Excel files, JSON files, XML files, and many other simple file-based formats.


Another definition

From the Encyclopedia Britannica, we get:

database, any collection of data or information specifically organized for rapid search by computer. Databases are structured to facilitate storage, retrieval, modification, and deletion of data.

See also the Database article on Wikipedia.

a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data.


Not only about finding data

No longer just talking about quickly finding information (the search part) but also about:

This is where a simple spreadsheet file no longer meets the objective.


What we expect from a DBMS

A database management system (DBMS) is distinguished from a simple spreadsheet by several essential functionalities.

FeatureDescriptionExcelDBMS
Data Storage and RetrievalStores data in an organized manner and retrieves it as needed.
Data ManipulationAllows adding, modifying, or deleting data.
Data QueryingAllows asking complex questions (queries) about the data.✔️
Data OrganizationStructures data in formats such as tables, documents, or graphs to facilitate management.
Data SharingAllows multiple users or applications to use the database simultaneously.
Data SecurityProtects data against unauthorized access or corruption.✔️
Concurrency ControlManages multiple users modifying data at the same time without conflicts.
Backup and RecoveryEnsures that data is not lost and can be restored in case of failure.✔️
Data IntegrityEnsures that data remains accurate, consistent, and reliable.
Performance OptimizationProvides tools to optimize the speed and efficiency of data retrievals and updates.
Support for TransactionsEnsures that a group of operations (transactions) is completed entirely or not at all.
ACID compliance

A Brief History of Databases

--- Evolution of database from 1970 to 2024

1970s - The Beginning of the Relational Era


1980s - The Domination of Relational

PostgreSQL

1990s - The Object-Oriented Wave


2000s - The Beginning of the NoSQL Revolution

And in 2009, 2 new NoSQL databases are launched:


Mongodb is web scale

That was 15 years ago back in Aug 27, 2010.


Why at this moment?

The rise of the world wide web (myspace - 2003 😍, youtube - 2005) and the massive increase in application scale by several orders of magnitude.

Suddenly, we have millions of people simultaneously trying to access and modify Terabytes of data in milliseconds.

Relational databases cannot keep up with the scale of applications, the chaotic nature of unstructured data, and the speed requirements.

The promise of NoSQL is volume and speed.


2010s

NoSQL Matures & Specialization - Big Data and Specialized Databases

and meanwhile, in 2013, Docker containers revolutionize database deployment


2020s: AI, Vector Databases, and Real-Time Needs


Current Trends (2025)

Vector search is booming. Vector search capabilities are being integrated into most existing DBMS, including PostgreSQL, MongoDB, and Neo4j.


In Brief

Evolution of database from 1970 to 2024

Types of relational databases


Other Types of databases

There are other types of databases than relational databases :


Main Categories of Databases Today

We have many databases to choose from. It all depends on scale, application nature, budget, etc.

Database TypePurposeExamplesApplication
Relational - SQLFixed schemaPostgreSQL, MySQL, OracleTransactions, normalization
Document StoresFlexible schema, JSON-like documentsMongoDB, CouchDBWeb applications, content management
Graph DatabasesRelationship-centered dataNeo4j, ArangoDBSocial networks, recommendation engines
Key-Value StoresSimple and fast lookupsRedis, DynamoDBCaching, session management
Vector DatabasesSimilarity search, AI embeddingsPinecone, WeaviateAI applications, semantic search
Column-Family StoresWide-column data, high scalabilityCassandra, HBaseTime-series, big data applications
Time-Series DatabasesTime-ordered dataInfluxDB, TimescaleDBIoT, monitoring systems

Fuzzy separation

mongodb vecto search screenshot.

Ecosystem

Check the ranking of all databases at https://db-engines.com/en/ranking

Trends: https://db-engines.com/en/ranking_trend


database trends.

Multitude of players:

source: https://www.generativevalue.com/p/a-primer-on-databases

Also see this interactive map that lists all players in 2023.

https://mad.firstmark.com/


Database Market Map 2024


Open source vs closed

Open-Source Databases

Advantages

✅ Free to use

✅ Transparent (view + audit code)

✅ Customizable

✅ Strong community support

✅ No vendor lock-in

Disadvantages

❌ May need more setup & tuning

❌ Support = community or paid 3rd party

❌ Fewer enterprise features (sometimes)


Proprietary Databases

Advantages

✅ Enterprise-grade support

✅ Robust tooling & integrations

✅ Optimized performance at scale

✅ Security/compliance features

Disadvantages

❌ Expensive licensing

❌ Closed codebase

❌ Vendor lock-in risk

❌ Limited customization


Postgres

---

Postgres rules!

In terms of relational databases compare

https://opensource.com/article/19/1/open-source-databases


Postgres advantages

PostgreSQL's most mentioned advantage is the efficiency of its central algorithm, which means it outperforms many databases that are advertised as more advanced.

One of the most flexible open source databases around;

You can write functions in a wide range of server-side languages: Python, Perl, Java, Ruby, C, and R.


SQLITE

SQLite is arguably the most implemented database engine in the world,

unlike many other databases it is not a client-server engine; rather, the full software is embedded into each implementation

to build and implement a small database, SQLite is arguably the best way to go. It is extremely small,

also very Fast


PostgresQL Extensions

Extensions : add-ons that enhance the functionality of a PostgreSQL database.

postgresql extensions


Case study

The Guardian: from MongoDB to postgresQL

Take a few minutes to read the article

https://www.theguardian.com/info/2018/nov/30/bye-bye-mongo-hello-postgres

from 2018, so no hasty conclusion. the article underlines the difficulty of choosing the right database. MongoDB has evolved a lot since then.


Takeaways

Reasons for Moving from MongoDB to Migration Process: PostgreSQL:

  1. Operational Challenges: The Guardian faced significant issues with MongoDB's OpsManager, including time-consuming upgrades, lack of effective support during outages, and the need for extensive custom scripting and management.

  2. Cost and Efficiency: The high cost of MongoDB's support contract combined with the ongoing operational burden led them to seek a more manageable and cost-effective solution.

  3. Feature Limitations: Alternatives like DynamoDB were considered but lacked essential features like encryption at rest, which Postgres on AWS RDS provided.

  4. Parallel APIs: They created a new API using PostgreSQL and ran it in parallel with the old MongoDB API to ensure a smooth transition.

  5. Data Migration: Content was migrated using a script that compared and validated data between the two databases.

  6. Proxy Usage: A proxy was employed to replicate traffic to both databases, ensuring consistency and allowing for real-time testing.

  7. Gradual Switchover: The team gradually shifted traffic to the new Postgres API, eventually decommissioning MongoDB without causing downtime.


Next

Install postgresql

1 / 0