normalization
OLAP vs OLTP
Simplicity: Create an understandable structure for developers and analysts
So the question is:
How to design the data structure of a database ?
Spoiler alert: it always comes down to balancing between read and write performance
We start with Entity-Relationship Diagrams: ERDs
and then:
Peter Chen developed the ER diagram in 1976.
Chen’s main contributions are formalizing the concepts, developing a theory with a set of data definition and manipulation operations, and specifying the translation rules from the ER model to several major types of databases including the Relational Database.
1976 !? That’s what the Mac looked like in 1976 🤪🤪🤪

The ER model was created to visualize data structures and relationships in many situations.
ER diagrams help in system design, information architecture, and application workflows.

student_id, course_name, or employee_email.| English grammar structure | ER structure |
|---|---|
| Common noun | Entity type |
| Proper noun | Entity |
| Transitive verb | Relationship type |
| Intransitive verb | Attribute type |
| Adjective | Attribute for entity |
| Adverb | Attribute for relationship |
see the wikipedia page
song entityplaylist entityuser entityartist entityduration attribute (3:42)play_count attributecreated_date attributepopularity_score attributeduration attributeis_public attribute (boolean)is_verified attributetimestamp attribute on “creates” relationshipposition attribute on “contains” relationshipsharing_type attribute on “shares” relationshipplay_frequency attribute on user-song relationship“The verified artist Queen performs the popular song Bohemian Rhapsody”
is_verified (Artist), popularity_score (Song)“John recently created a public playlist called Summer Vibes”
timestamp (recently)is_public (public)“The playlist frequently contains long songs”
frequency (frequently)duration (long)[User] ----creates----> [Playlist] ----contains----> [Song]
| (when?) | (position?) |
| | |
name is_public duration
email name title
created_date popularity
[Artist]
|
performs
|
[Song]
In the context of relational databases, the ER Diagram represent the structure of the database.
The ER diagram displays the relations between the entities (tables) present in the database and lists their attributes (columns).

You can change notation for the relation type with


The usage of a database drives its data structure.

can be asynchronous, query execution does not have to be lightning fast
OLAP does not have to be super fast
example:
synchronous, real time, speed is super important
OLTP has to be super fast
example:
Further reading : difference between olap and oltp in dbms.
Look at the table of differences and the Q&A at the end of the article.
For each scenario, determine whether it’s more suited for an OLTP (Online Transaction Processing) or OLAP (Online Analytical Processing) system.
Solution
A user can have multiple phones. Consider the 2 database designs:
1 table with multiple phone columns

2 tables

which design (1 or 2 tables) is better in terms of faster or simpler query for:
The general goal of normalization is to reduce data redundancy and dependency by organizing data into separate, related tables.
A database is normalized if:
Each piece of information lives in exactly one place
or more formally
all column values depend only on the table primary key
This helps maintain data integrity and flexibility:
Normalized databases are
In the 1 table design for the account and its phone numbers, a phone number value depends on the name of the phone column (home_phone, work_phone, …) not just the account_id key: We can say it’s not normalized
With a table dedicated to the phones, (design with 2 tables), the phone value depends only on the phone_id key: the tables are normalized.
1 table with multiple phone columns

2 tables

The idea of denormalization is to create data redundancy to simplify queries and make OLAP queries faster.
Redundant data : the same data / info exists in multiple tables
SELECT queries involve fewer JOINs.
However INSERT, UPDATE, DELETE queries are more complex as multiple tables must be accounted for. Therefore data integrity is more complex to preserve.
In a social network, imagine that you have two tables:
user_id name and email.| user_id | user_name | |
|---|---|---|
| 101 | Ulaf | …[email protected] |
| 102 | Birgitte | …[email protected] |
| 103 | Inge | …[email protected] |
| 114 | Boris | …[email protected] |
post_id content, publication_date … and the post’s author information through the user_id.| post_id | user_id | content | pub_date |
|---|---|---|---|
| 1 | 101 | I ❤️ postgreSQL | 2022-10-01 |
| 2 | 103 | Singing in the rain | 2022-10-02 |
| 3 | 102 | Nerds unite! | 2022-10-02 |
| 4 | 114 | Guys? i’m bored 😒 when’s the break? | 2022-10-02 |
| 5 | 103 | Taylor swift omg! #sweet | 2022-10-03 |
In a normalized database: users and their posts are only linked by the user_id which is a foreign key in the posts table.
The users table has no information about posts. And similarly the posts table is all about the posts and not their author (besides the user_id foreign key).
So when you need to display the user’s name next to their post, you need to JOIN Users and Posts tables.
SELECT p.*, u.name
FROM POSTS p
JOIN users u on p.user_id = u.id
WHERE p.pub_date = '2022-10-02';
Too many JOINs over large volumes of data will slow down the query.
In order to improve performance, you can denormalize the posts table by adding the user_name to the Posts table.
A Denormalized Posts table would then look like:
| post_id | user_id | user_name | content | pub_date |
|---|---|---|---|---|
| 1 | 101 | Ulaf | I ❤️ postgreSQL | 2022-10-01 |
| 2 | 103 | Inge | Singing in the rain | 2022-10-02 |
| 3 | 102 | Birgitte | Nerds unite! | 2022-10-02 |
| 4 | 114 | Boris | Guys? i’m bored 😒 when’s the break? | 2022-10-02 |
| 5 | 103 | Inge | Taylor swift omg! #sweet | 2022-10-03 |
SELECT p.*
FROM POSTS p
WHERE p.pub_date = '2022-10-02';
Faster read performance since you can fetch the user_name along with the post data without needing to perform a join between the Users and Posts tables.
But
Data redundancy: If Ulaf changes his name, you will need to update it in both the Users table and every row in the Posts table that references him. This increases the complexity of updates.
So, given a database, how do you know if it needs to be normalized?
There are 3 types of anomalies that you can look for:
A normalized database will solve these anomalies.
Consider the table of teachers and their courses
| id | teacher | course |
|---|---|---|
| 1 | Bjorn | Intermediate Database |
| 2 | Sofia | Crypto Currencies |
| 3 | Hussein | Data Analytics |
Now a new teacher (Alexis), is recruited by the school. The teacher does not have a course assigned to yet. If we want to insert the teacher in the table we need to put a NULL value in the course column.
and NULL values can cause problems!!! (more on that later)
More generally:
Consider now the following movies, directors, year and production house
| id | Film | Director | Production House |
|---|---|---|---|
| 1 | Sholay | Ramesh Sippy | Sippy Films |
| 2 | Dilwale Dulhania Le Jayenge | Aditya Chopra | Yash Raj Films |
| 3 | Kabhi Khushi Kabhie Gham | Karan Johar | Dharma Productions |
| 4 | Kuch Kuch Hota Hai | Karan Johar | Dharma Productions |
| 5 | Lagaan | Ashutosh Gowariker | Aamir Khan Productions |
(chatGPT knows Bollywood 😄) + (I ❤️ Shahrukh Khan)
If one of the production house changes its name, we need to update multiple rows.
In a small example like this one, this is not a problem since the query is trivial but in large databases some rows may not be updated.
By moving the production house data in its own separate table, then updating its name would only impact one row!
Here is a table about programmers, languages and (secret) projects
| Developer_Name | Language | Project_Name |
|---|---|---|
| Lisa | Python | Atomic Blaster |
| Bart | Java | Invisible tank |
| Montgomery | Python | Atomic Blaster |
| Maggie | JavaScript | Exploding Squirrel |
| Ned | C++ | War AI |
| Homer | Python | Atomic Blaster |
| Marge | Ruby | War AI |
Let’s say that for imperative reasons we need to cover up the existence of the War AI project. And we will delete all rows that mention that project in the databse. Then an unfortunate consequence is that we will also also delete all mentions of Marge and Ned since they are not involved in other projects.
So by deleting an attribute we also remove certain valus of other attributes.
When you have different but related natural / logical entities packed together in the same table, this causes anomalies and you need to normalize.
Why is it a problem to have NULL values in a column?
Ambiguity: Null can mean “unknown”, “not applicable” or “missing” leading to confusion.
IS NULL vs =) .NULL values are ignored in aggregate functions (like SUM, COUNT).Impacts indexing and performance: since Nulls are excluded from indexes
So avoid NULL values if you can!

This is the schema for the imdb top 1000 table
What anomalies can you spot?
CREATE TABLE imdb_raw (
Poster_Link TEXT,
Series_Title TEXT,
Released_Year TEXT,
Certificate TEXT,
Runtime TEXT,
Genre TEXT, -- "drama, sci-fi, action"
IMDB_Rating TEXT,
Overview TEXT,
Meta_score TEXT,
Director TEXT,
Star1 TEXT,
Star2 TEXT,
Star3 TEXT,
Star4 TEXT,
No_of_Votes TEXT,
Gross TEXT
);
1. [deletion] Losing Director Information
Problem: Deleting a movie shouldn’t mean losing knowledge about directors
2. [insertion] Can’t Track Actors Without Casting Them
Problem: Actor existence shouldn’t depend on being in top 4 billing
3. [update] Updating Actor Information
If “Tom Hanks” appears as:
To fix a typo (e.g., “Tom Hansk”):
4. [update] Certificate/Rating System Changes
If rating system changes (e.g., “PG-13” becomes “T-13”):
What anomalies can you find for each type: insertion, update and deletion
Insertion Anomalies:
Update Anomalies:
Deletion Anomalies:
There’s a need for normalization when:
every non-key attribute
must provide a fact about the key,
the whole key
and nothing but the key
A normal form is a rule that defines a level of normalization.
In general a database is considered normalized if it meets 3NF level.
There are multiple higher levels
Normal forms are a gradual step by step process towards normalization. Each form is a guide that focuses on a single type of problem.
We can always choose to apply a normalization form or to ignore it
In the following, we mention:
Each field contains a single value
A relation is in first normal form
if and only if
no attribute domain has relations as elements.
which translates as no column as sets of values (relations) as elements
In short: columns cannot composite values : arrays, json, …
More on 1NF:
Does the tree table follow 1NF ? what about the imdb table ?
There’s a contradiction between first normal form (1NF) and the existence of ARRAY and JSONB data types (see also POINT, HRANGE, composite types, and many other) in SQL databases.
The rule of thumb is :
When you frequently need to access, modify, handle the elements of the sets of values for a given record then apply 1NF
For instance a collaborative blog post where we start with 1 author (varchar), then a co-author comes along (array), then a third and a fourth. Then the 1st one give up and does not want to be in the list of authors etc etc etc. In that case using an Array to store the list of authors causes more trouble than it solves. and authors should have their own table.
In the end it’s a balance between simplicity and ease of control.
For instance:
The table is in 2NF iff :
Some cases of non-compliance with 2NF
Employee(EmployeeID, Name, BirthDate, Age)Age is derived from BirthDate, causing a partial dependency.Order(OrderID, ProductID, ProductName, ProductCategory)ProductCategory and ProductName both depend on ProductID, not the full OrderID.B(Bid, B.attribute, A.attribute)R(A, B, C, D) where (A, B) is the composite primary keyC depends only on A, we have a 2NF violation.The 1 table version of the account phone table was not in 2NF.
see :
The tree table has many 2NF violations, since all the categorical columns are into a one to many (category (A) has many trees (B)) relation.
2NF normalization tells us we should create a table for all the categorical columns.
But that may feel overkill and instead of simplifying the logical structure of the data it might add too much complexity it without clear gain. (but we will still do it anyway)
On the contrary, keeping the categories in the tree table is a form of denormalization.
You might consider normalizing a categorical attribute (column) into a separate table if:
In practice the design of the database can evolve:
A relation R is in 3NF if and only if both of the following conditions hold:
A transitive dependency occurs when a non-prime attribute (an attribute that is not part of any key) depends on another non-prime attribute, rather than depending directly on the primary key.
where:
It’s becoming a bit abstract 🥱😴
For the statisticians in the room, it’s a bit like confonders.
So basically, in a table you would have 2 columns that are not keys that sort of depends on each other.
Student(StudentID, Name, CourseID, CourseName, InstructorName)
This relation violates 3NF because:
CourseID → CourseName (CourseName depends on CourseID, not on StudentID)CourseID → InstructorName (InstructorName depends on CourseID, not on StudentID)Here, we have transitive dependencies:
StudentID → CourseID → CourseNameStudentID → CourseID → InstructorName3NF is similar to 2NF
The key differences:
Nature of the dependency:
Scope of the problem:
Denormalization means to choose not to follow a normal form to simplify things.
So, when can or should we denormalize?
Once you have thoroughly normalized your database, some difficulties may appear, either because the fully normalized schema is too heavy to deal with at the application level without any benefits, or because having a highly normalized schema involves performances penalties that you’ve measured and cannot tolerate.
Fully normalized schemas often have a high number of tables and references in between them. That means lots of foreign key constraints and lots of join operations in all your application queries.
When there’s no way to speed-up your application another way, then it is time to denormalize the schema, i.e. make a decision to put your data quality at risk in order to be able to serve your users and business.
Dimitri Fontaine - The Art of PostgreSQL (2022) The Art of PostgreSQL p 283
In short: Keep it simple
PostgreSQL offers several data types that can store multiple values in a single column, which breaks the First Normal Form (1NF).
These types are often used for performance optimization, improved query efficiency, or when the data naturally fits a more complex structure.
There are many data types in postgreSQL
see for instance the long list of availabe data types when adding a column in pgAdmin

and the documentation on data types
Allows storing multiple values of the same type in a single column.
Example: INTEGER[], TEXT[]
When to use:
Stores JSON data with indexing and querying capabilities
When to use:
also : HSTORE, RANGE …
User-defined type that combines multiple fields into a single column.
Example: CREATE TYPE address AS ( street TEXT, city TEXT, zip TEXT );
When to use: