Postgresql database normalization

Intermediate databases - Epita Fall 2025

Database design : normalization

Goals of a good data structure design

and most important:

as we will see it always comes down to balancing between read and write performance

So the question is: How to design the data structure of a database ?

Scope

We start with Entity-Relationship Diagrams: ERDs

and then:

Practice: we normalize the treesdb database


Entity Relation Diagrams

Peter Chen developed the ER diagram in 1976.

Chen’s main contributions are formalizing the concepts, developing a theory with a set of data definition and manipulation operations, and specifying the translation rules from the ER model to several major types of databases (including the Relational Database)

1976

That’s what the Mac looked like in 1976 🤪🤪🤪

The ER model was created to visualize data structures and relationships in many situations.

ER diagrams help in system design, information architecture, and application workflows.

The components of an ER diagram are :

Check out this article for a complete explanation of ER diagrams: Introduction of ER model

As you can see there are many types of entities and attributes : strong, weak, key, composite, etc …

Read also the wikipedia page

ER diagram for relational databases

In the context of relational databases, the ER Diagram represent the structure of the database.

The ER diagram displays the relations between the entities (tables) present in the database and lists their attributes (columns).

Generate an ERD in pgAdmin

You can change notation for the relation type with


2 main types of databases: OLAP vs OLTP

The end usage of a database drives its data structure.

We consider two types of relational databases: analytical databases (OLAP) vs transactional databases (OLTP)

OLAP : Online Analytical Processing

OLAP does not have to be super fast

OLTP: Online Transaction Processing,

OLTP has to be super fast

Further reading : difference between olap and oltp in dbms.

Look at the table of differences and the Q&A at the end of the article.

Quiz

For each scenario, determine whether it’s more suited for an OLTP (Online Transaction Processing) or OLAP (Online Analytical Processing) system.

Solution


An account with multiple phones

Consider the 2 database designs:

which design (1 or 2 tables) is better in terms of faster or simpler query for:


Normalization

The general goal of normalization is to reduce data redundancy and dependency by organizing data into separate, related tables.

This helps maintain data integrity and flexibility:

Normalized databases are

More formally, a database is normalized if:

all column values depend only on the table primary key,

In the 1 table design for the account and its phone numbers, a phone number value depends on the name of the phone column (home_phone, work_phone, …) not just the account_id key: We can say it’s not normalized

With a table dedicated to the phones, (design with 2 tables), the phone value depends only on the phone_id key: the tables are normalized.


Denormalization

The idea of denormalization is to have data redundancy to simplify queries and make OLAP queries faster.

Redundant data : the same data / info exists in multiple tables

SELECT queries involve fewer JOINs.

However INSERT, UPDATE, DELETE queries are more complex as multiple tables must be accounted for. Therefore data integrity is more complex to preserve.

Scenario:

In a social network, imagine that you have two tables:

  1. The Users table: Contains user information like user_id name and email.
user_id user_name email
101 Ulaf [email protected]
102 Birgitte [email protected]
103 Inge [email protected]
114 Boris [email protected]
  1. The Posts table: Contains posts made by users, with fields like post_id content, publication_date … and the post’s author information through the user_id.
post_id user_id content pub_date
1 101 I ❤️ postgreSQL 2022-10-01
2 103 Singing in the rain 2022-10-02
3 102 Nerds unite! 2022-10-02
4 114 Guys? i’m bored 😒 when’s the break? 2022-10-02
5 103 Taylor swift omg! #sweet 2022-10-03

In a normalized database: users and their posts are only linked by the user_id which is a foreign key in the posts table.

The users table has no information about posts. And similarly the posts table is all about the posts and not their author (besides the user_id foreign key).

So when you need to display the user’s name next to their post, you need to JOIN Users and Posts tables.

SELECT p.*, u.name
FROM POSTS p
JOIN users u on p.user_id = u.id
WHERE p.pub_date = '2022-10-02';

In order to improve performance, you can denormalize the posts table by adding the user_name to the Posts table.

A Denormalized Posts table would then look like:

post_id user_id user_name content pub_date
1 101 Ulaf I ❤️ postgreSQL 2022-10-01
2 103 Inge Singing in the rain 2022-10-02
3 102 Birgitte Nerds unite! 2022-10-02
4 114 Boris Guys? i’m bored 😒 when’s the break? 2022-10-02
5 103 Inge Taylor swift omg! #sweet 2022-10-03
SELECT p.*
FROM POSTS p
WHERE p.pub_date = '2022-10-02';

Faster read performance since you can fetch the user_name along with the post data without needing to perform a join between the Users and Posts tables.

But

Data redundancy: If Ulaf changes his name, you will need to update it in both the Users table and every row in the Posts table that references him. This increases the complexity of updates.


Anomalies

So, given a database, how do you know if it needs to be normalized?

There are 3 types of anomalies that you can look for:

A normalized database will solve these anomalies.


Insertion anomalies

Consider the table of teachers and their courses

id teacher course
1 Bjorn Intermediate Database
2 Sofia Crypto Currencies
3 Hussein Data Analytics

Now a new teacher (Alexis), is recruited by the school. The teacher does not have a course assigned to yet. If we want to insert the teacher in the table we need to put a NULL value in the course column.

and NULL values can cause problems!!! (more on that later)

More generally:


Update anomaly

Consider now the following movies, directors, year and production house

id Film Director Production House
1 Sholay Ramesh Sippy Sippy Films
2 Dilwale Dulhania Le Jayenge Aditya Chopra Yash Raj Films
3 Kabhi Khushi Kabhie Gham Karan Johar Dharma Productions
4 Kuch KuchwHota Hai Karan Johar Dharma Productions
5 Lagaan Ashutosh Gowariker Aamir Khan Productions
6 Dangal Nitesh Tiwari Aamir Khan Productions
7 Bajrangi Bhaijaan Kabir Khan Salman Khan Films
8 My Name Is Khan Karan Johar Dharma Productions
9 Gully Boy Zoya Akhtar Excel Entertainment
10 Zindagi Na Milegi Dobara Zoya Akhtar Excel Entertainment

(chatGPT knows Bollywood 😄)

If one of the production house changes its name, we need to update multiple rows.

In a small example like this one, this is not a problem since the query is trivial but in large databases some rows may not be updated.

By moving the production house data in its own separate table, then updating its name would only impact one row!


Deletion errors

Here is a table about programmers, languages and (secret) projects

Developer_Name Language Project_Name
Lisa Python Atomic Blaster
Bart Java Invisible tank
Montgomery Python Atomic Blaster
Maggie JavaScript Exploding Squirrel
Ned C++ War AI
Homer Python Atomic Blaster
Marge Ruby War AI

Let’s say that for imperative reasons we need to cover up the existence of the War AI project. And we will delete all rows that mention that project in the databse. Then an unfortunate consequence is that we will also also delete all mentions of Marge and Ned since they are not involved in other projects.

So by deleting an attribute we also remove certain valus of other attributes.

In short

When you have different but related natural / logical entities packed together in the same table, this causes anomalies and you need to normalize.


The problem with null values

Why is it a problem to have NULL values in a column?

So avoid NULL values if you can!


What about anomalies in the trees table ?

What anomalies can you find for each type: insertion, update and deletion

Insertion Anomalies:

Update Anomalies:

Deletion Anomalies:


In short

There’s a need for normalization when:

or simply said:

every non-key attribute must provide a fact about the key, the whole key, and nothing but the key


Normal forms

A normal form is a rule that defines a level of normalization.

There are multiple higher levels

In general a database is considered normalized if it meets 3NF level.

It gets very abstracts very quickly.

Normal forms are a gradual step by step process towards normalization. Each form is a guide that focuses on a single type of problem.

We can always choose to apply a normalization form or to ignore it

In the following, we mention:

1NF

Each field contains a single value

A relation is in first normal form

if and only if

no attribute domain has relations as elements.

which translates as no column as sets of values (relations) as elements

In short: columns cannot composite values : arrays, json, …

More on 1NF:

Does the tree table follow 1NF ?

Wait, … what ?

There’s a contradiction between first normal form (1NF) and the existence of ARRAY and JSONB data types (see also POINT, HRANGE, composite types, and many other) in SQL databases.

The rule of thumb is :

When you frequently need to access, modify, handle the elements of the sets of values for a given record then apply 1NF

For instance a collaborative blog post where we start with 1 author (varchar), then a co-author comes along (array), then a third and a fourth. Then the 1st one give up and does not want to be in the list of authors etc etc etc. In that case using an Array to store the list of authors causes more trouble than it solves. and authors should have their own table.

In the end it’s a balance between simplicity and ease of control.

For instance:


2NF

The table is in 2NF iff :

Some cases of non-compliance with 2NF

The 1 table version of the account phone table was not in 2NF.

see :

2NF violation in the trees table ?

The tree table has many 2NF violations, since all the categorical columns are into a one to many (category (A) has many trees (B)) relation.

2NF normalization tells us we should create a table for all the categorical columns.

But that may feel overkill and instead of simplifying the logical structure of the data it might add too much complexity it without clear gain. (but we will still do it anyway)

On the contrary, keeping the categories in the tree table is a form of denormalization.

When to apply 2NF to a categorical attribute ?

You might consider normalizing a categorical attribute (column) into a separate table if:

In practice the design of the database can evolve:


3NF

A relation R is in 3NF if and only if both of the following conditions hold:

A transitive dependency occurs when a non-prime attribute (an attribute that is not part of any key) depends on another non-prime attribute, rather than depending directly on the primary key.

where:

It’s becoming a bit abstract 🥱😴

For the statisticians in the room, it’s a bit like confonders.

So basically, in a table you would have 2 columns that are not keys that sort of depends on each other.

Example

Student(StudentID, Name, CourseID, CourseName, InstructorName)

This relation violates 3NF because:

Here, we have transitive dependencies:

3NF on Wikipedia

Difference between 2NF and 3NF

3NF is similar to 2NF

The key differences:


Denormalize

Denormalization means to choose not to follow a normal form to simplify things.

So, when can we denormalize instead?

Once you have thoroughly normalized your database, some difficulties may appear, either because the fully normalized schema is too heavy to deal with at the application level without any benefits, or because having a highly normalized schema involves performances penalties that you’ve measured and cannot tolerate.

Fully normalized schemas often have a high number of tables and references in between them. That means lots of foreign key constraints and lots of join operations in all your application queries.

When there’s no way to speed-up your application another way, then it is time to denormalize the schema, i.e. make a decision to put your data quality at risk in order to be able to serve your users and business.

Dimitri Fontaine - The Art of PostgreSQL (2022) The Art of PostgreSQL p 283

In short: Keep it simple


PostgreSQL multi-valued data types

PostgreSQL offers several data types that can store multiple values in a single column, which breaks the First Normal Form (1NF).

These types are often used for performance optimization, improved query efficiency, or when the data naturally fits a more complex structure.

There are many data types in postgreSQL

see for instance the long list of availabe data types when adding a column in pgAdmin

and the documentation on data types


Array Types:

Allows storing multiple values of the same type in a single column.

Example: INTEGER[], TEXT[]

When to use:

JSONB (and JSON):

Stores JSON data with indexing and querying capabilities

When to use:

also : HSTORE, RANGE …

Composite types

User-defined type that combines multiple fields into a single column.

Example: CREATE TYPE address AS ( street TEXT, city TEXT, zip TEXT );

When to use:

Next

Let’s apply te normalization

  1. on the worlddb song table
  2. then on the trees table