Database design


normalization
OLAP vs OLTP

---

Expectations from a database


Database Design Goals

## Data Integrity
  • Entity integrity (primary keys)
  • Referential integrity (foreign keys)
  • Domain integrity (valid data types/constraints)

Minimize Redundancy

  • Eliminate duplicate data
  • Single source of truth
  • Reduce update anomalies
## Performance & Efficiency
  • Optimal query performance
  • Efficient storage utilization
  • Appropriate indexing

Scalability & Flexibility

  • Handle growing data volumes
  • Adapt to changing requirements
  • Support future enhancements

Simplicity

Simplicity: Create an understandable structure for developers and analysts

So the question is:

**How to design the data structure of a database ?**

Spoiler alert: it always comes down to balancing between read and write performance

Scope

We start with Entity-Relationship Diagrams: ERDs

and then:


Entity Relation Diagrams

Peter Chen developed the ER diagram in 1976.

Chen's main contributions are formalizing the concepts, developing a theory with a set of data definition and manipulation operations, and specifying the translation rules from the ER model to several major types of databases including the Relational Database.

1976 !? That's what the Mac looked like in 1976 πŸ€ͺπŸ€ͺπŸ€ͺ

The ER model was created to visualize data structures and relationships in many situations.

ER diagrams help in system design, information architecture, and application workflows.



Components of an ER diagram

[Introduction of ER model](https://www.geeksforgeeks.org/introduction-of-er-model/)

ER diagram rules

English grammar structureER structure
Common nounEntity type
Proper nounEntity
Transitive verbRelationship type
Intransitive verbAttribute type
AdjectiveAttribute for entity
AdverbAttribute for relationship

see the wikipedia page


ER Grammar Mapping: Songs & Playlists Example

  1. Common Noun β†’ Entity Type
  1. Proper Noun β†’ Entity (Instance)
  1. Transitive Verb β†’ Relationship Type (Verbs that need an object)
  1. Intransitive Verb β†’ Attribute Type (Verbs that don't need an object - describe state/action)
  1. Adjective β†’ Attribute for Entity (Describes characteristics of nouns/entities)
  1. Adverb β†’ Attribute for Relationship (Describes HOW the relationship occurs)

Complete Example Sentences Mapped to ER:

"The verified artist Queen performs the popular song Bohemian Rhapsody"

"John recently created a public playlist called Summer Vibes"

"The playlist frequently contains long songs"


Visual ER Representation:

[User] ----creates----> [Playlist] ----contains----> [Song]
  |        (when?)         |          (position?)       |
  |                        |                            |
  name                  is_public                   duration
  email                 name                        title
                        created_date                popularity

                                                    [Artist]
                                                        |
                                                    performs
                                                        |
                                                    [Song]

ER diagram for relational databases

In the context of relational databases, the ER Diagram represent the structure of the database.

The ER diagram displays the relations between the entities (tables) present in the database and lists their attributes (columns).


Generate an ERD in pgAdmin

  • connect to the remote server
  • right click on a database name
  • click on ERD for database

You can change notation for the relation type with


ERD - IMDB normalized


Database types: OLAP vs OLTP

--- # 2 main types of databases: OLAP vs OLTP

The usage of a database drives its data structure.


OLAP : Online Analytical Processing

OLAP does not have to be super fast

example:


OLTP: Online Transaction Processing,

OLTP has to be super fast

example:


Further reading : difference between olap and oltp in dbms.

Look at the table of differences and the Q&A at the end of the article.


Quiz

For each scenario, determine whether it's more suited for an OLTP (Online Transaction Processing) or OLAP (Online Analytical Processing) system.


Solution


Database design

---

A user can have multiple phones. Consider the 2 database designs:

**1 table** with multiple phone columns
**2 tables**

which design (1 or 2 tables) is better in terms of faster or simpler query for:


Normalization

The general goal of normalization is to reduce data redundancy and dependency by organizing data into separate, related tables.

A database is normalized if:

Each piece of information lives in exactly one place

or more formally

**all column values depend only on the table primary key**

This helps maintain data integrity and flexibility:

Normalized databases are

In the 1 table design for the account and its phone numbers, a phone number value depends on the name of the phone column (home_phone, work_phone, ...) not just the account_id key: We can say it's not normalized

With a table dedicated to the phones, (design with 2 tables), the phone value depends only on the phone_id key: the tables are normalized.

**1 table** with multiple phone columns
**2 tables**

Denormalization

The idea of denormalization is to create data redundancy to simplify queries and make OLAP queries faster.

Redundant data : the same data / info exists in multiple tables

SELECT queries involve fewer JOINs.

However INSERT, UPDATE, DELETE queries are more complex as multiple tables must be accounted for. Therefore data integrity is more complex to preserve.


Scenario:

In a social network, imagine that you have two tables:

  1. The Users table: Contains user information like user_id name and email.
user_iduser_nameemail
101Ulaf[email protected]
102Birgitte[email protected]
103Inge[email protected]
114Boris[email protected]
  1. The Posts table: Contains posts made by users, with fields like post_id content, publication_date ... and the post's author information through the user_id.
post_iduser_idcontentpub_date
1101I ❀️ postgreSQL2022-10-01
2103Singing in the rain2022-10-02
3102Nerds unite!2022-10-02
4114Guys? i'm bored πŸ˜’ when's the break?2022-10-02
5103Taylor swift omg! #sweet2022-10-03

In a normalized database: users and their posts are only linked by the user_id which is a foreign key in the posts table.

The users table has no information about posts. And similarly the posts table is all about the posts and not their author (besides the user_id foreign key).

So when you need to display the user's name next to their post, you need to JOIN Users and Posts tables.

SELECT p.*, u.name
FROM POSTS p
JOIN users u on p.user_id = u.id
WHERE p.pub_date = '2022-10-02';

Too many JOINs over large volumes of data will slow down the query.


Denormalization

In order to improve performance, you can denormalize the posts table by adding the user_name to the Posts table.

A Denormalized Posts table would then look like:

post_iduser_iduser_namecontentpub_date
1101UlafI ❀️ postgreSQL2022-10-01
2103IngeSinging in the rain2022-10-02
3102BirgitteNerds unite!2022-10-02
4114BorisGuys? i'm bored πŸ˜’ when's the break?2022-10-02
5103IngeTaylor swift omg! #sweet2022-10-03
SELECT p.*
FROM POSTS p
WHERE p.pub_date = '2022-10-02';

Faster read performance since you can fetch the user_name along with the post data without needing to perform a join between the Users and Posts tables.

But

Data redundancy: If Ulaf changes his name, you will need to update it in both the Users table and every row in the Posts table that references him. This increases the complexity of updates.


Anomalies

---

Anomalies

So, given a database, how do you know if it needs to be normalized?

There are 3 types of anomalies that you can look for:

A normalized database will solve these anomalies.


Insertion anomalies

Consider the table of teachers and their courses

idteachercourse
1BjornIntermediate Database
2SofiaCrypto Currencies
3HusseinData Analytics

Now a new teacher (Alexis), is recruited by the school. The teacher does not have a course assigned to yet. If we want to insert the teacher in the table we need to put a NULL value in the course column.

and NULL values can cause problems!!! (more on that later)

More generally:


Update anomaly

Consider now the following movies, directors, year and production house

idFilmDirectorProduction House
1SholayRamesh SippySippy Films
2Dilwale Dulhania Le JayengeAditya ChopraYash Raj Films
3Kabhi Khushi Kabhie GhamKaran JoharDharma Productions
4Kuch Kuch Hota HaiKaran JoharDharma Productions
5LagaanAshutosh GowarikerAamir Khan Productions

(chatGPT knows Bollywood πŸ˜„) + (I ❀️ Shahrukh Khan)

If one of the production house changes its name, we need to update multiple rows.

In a small example like this one, this is not a problem since the query is trivial but in large databases some rows may not be updated.

By moving the production house data in its own separate table, then updating its name would only impact one row!


Deletion errors

Here is a table about programmers, languages and (secret) projects

Developer_NameLanguageProject_Name
LisaPythonAtomic Blaster
BartJavaInvisible tank
MontgomeryPythonAtomic Blaster
MaggieJavaScriptExploding Squirrel
NedC++War AI
HomerPythonAtomic Blaster
MargeRubyWar AI

Let's say that for imperative reasons we need to cover up the existence of the War AI project. And we will delete all rows that mention that project in the databse. Then an unfortunate consequence is that we will also also delete all mentions of Marge and Ned since they are not involved in other projects.

So by deleting an attribute we also remove certain valus of other attributes.


In short

When you have different but related natural / logical entities packed together in the same table, this causes anomalies and you need to normalize.


The problem with null values

Why is it a problem to have NULL values in a column?

So avoid NULL values if you can!


Anomalies in the imdb top 1000 table

This is the schema for the imdb top 1000 table

What anomalies can you spot?

CREATE TABLE imdb_raw (
  Poster_Link   TEXT,
  Series_Title  TEXT,
  Released_Year TEXT,
  Certificate   TEXT,
  Runtime       TEXT,
  Genre         TEXT,       -- "drama, sci-fi, action"
  IMDB_Rating   TEXT,
  Overview      TEXT,
  Meta_score    TEXT,
  Director      TEXT,
  Star1         TEXT,
  Star2         TEXT,
  Star3         TEXT,
  Star4         TEXT,
  No_of_Votes   TEXT,
  Gross         TEXT
);

1. [deletion] Losing Director Information

Problem: Deleting a movie shouldn't mean losing knowledge about directors

2. [insertion] Can't Track Actors Without Casting Them

Problem: Actor existence shouldn't depend on being in top 4 billing

3. [update] Updating Actor Information

If "Tom Hanks" appears as:

To fix a typo (e.g., "Tom Hansk"):

4. [update] Certificate/Rating System Changes

If rating system changes (e.g., "PG-13" becomes "T-13"):


What about anomalies in the trees table ?

What anomalies can you find for each type: insertion, update and deletion

Insertion Anomalies:

Update Anomalies:

Deletion Anomalies:


In short

There's a need for normalization when:
  • the same data exists in multiple rows
  • multiple columns with data of same nature, different types
  • same data exists in multiple tables

every non-key attribute
must provide a fact about the key,
the whole key
and nothing but the key


Normal forms

---

Normal forms

A normal form is a rule that defines a level of normalization.

  • UNF: Unnormalized form
  • 1NF: First normal form
  • 2NF: Second normal form
  • 3NF: Third normal form
In general a database is considered normalized if it meets 3NF level.
There are multiple higher levels
  • EKNF: Elementary key normal form
  • BCNF: Boyce–Codd normal form
  • 4NF: Fourth normal form
  • ETNF: Essential tuple normal form
  • 5NF: Fifth normal form
  • DKNF: Domain-key normal form
  • 6NF: Sixth normal form

Normal forms are a gradual step by step process towards normalization. Each form is a guide that focuses on a single type of problem.

We can always choose to apply a normalization form or to ignore it

In the following, we mention:


1NF

Each field contains a single value

A relation is in first normal form
if and only if
no attribute domain has **relations** as elements.

which translates as no column as sets of values (relations) as elements

In short: columns cannot composite values : arrays, json, ...

More on 1NF:

Does the tree table follow 1NF ? what about the imdb table ?


Wait, ... what ?

There's a contradiction between first normal form (1NF) and the existence of ARRAY and JSONB data types (see also POINT, HRANGE, composite types, and many other) in SQL databases.

The rule of thumb is :

When you frequently need to access, modify, handle the elements of the sets of values for a given record then apply 1NF

For instance a collaborative blog post where we start with 1 author (varchar), then a co-author comes along (array), then a third and a fourth. Then the 1st one give up and does not want to be in the list of authors etc etc etc. In that case using an Array to store the list of authors causes more trouble than it solves. and authors should have their own table.

In the end it's a balance between simplicity and ease of control.

For instance:


2NF

The table is in 2NF iff :

Some cases of non-compliance with 2NF

The 1 table version of the account phone table was not in 2NF.

see :


2NF violation in the trees table ?

The tree table has many 2NF violations, since all the categorical columns are into a one to many (category (A) has many trees (B)) relation.

2NF normalization tells us we should create a table for all the categorical columns.

But that may feel overkill and instead of simplifying the logical structure of the data it might add too much complexity it without clear gain. (but we will still do it anyway)

On the contrary, keeping the categories in the tree table is a form of denormalization.


When to apply 2NF to a categorical attribute ?

You might consider normalizing a categorical attribute (column) into a separate table if:

In practice the design of the database can evolve:


3NF

A relation R is in 3NF if and only if both of the following conditions hold:

A transitive dependency occurs when a non-prime attribute (an attribute that is not part of any key) depends on another non-prime attribute, rather than depending directly on the primary key.

where:

It's becoming a bit abstract πŸ₯±πŸ˜΄

For the statisticians in the room, it's a bit like confonders.

So basically, in a table you would have 2 columns that are not keys that sort of depends on each other.


Example

Student(StudentID, Name, CourseID, CourseName, InstructorName)

This relation violates 3NF because:

Here, we have transitive dependencies:

3NF on Wikipedia


Difference between 2NF and 3NF

3NF is similar to 2NF

The key differences:


Denormalize

Denormalization means to choose not to follow a normal form to simplify things.

So, when can or should we denormalize?

Once you have thoroughly normalized your database, some difficulties may appear, either because the fully normalized schema is too heavy to deal with at the application level without any benefits, or because having a highly normalized schema involves performances penalties that you’ve measured and cannot tolerate.

Fully normalized schemas often have a high number of tables and references in between them. That means lots of foreign key constraints and lots of join operations in all your application queries.

When there’s no way to speed-up your application another way, then it is time to denormalize the schema, i.e. make a decision to put your data quality at risk in order to be able to serve your users and business.

Dimitri Fontaine - The Art of PostgreSQL (2022) The Art of PostgreSQL p 283

In short: Keep it simple


PostgreSQL multi-valued data types

PostgreSQL offers several data types that can store multiple values in a single column, which breaks the First Normal Form (1NF).

These types are often used for performance optimization, improved query efficiency, or when the data naturally fits a more complex structure.

There are many data types in postgreSQL

see for instance the long list of availabe data types when adding a column in pgAdmin

and the documentation on data types


Array Types:

Allows storing multiple values of the same type in a single column.

Example: INTEGER[], TEXT[]

When to use:

JSONB (and JSON):

Stores JSON data with indexing and querying capabilities

When to use:

also : HSTORE, RANGE …

Composite types

User-defined type that combines multiple fields into a single column.

Example: CREATE TYPE address AS ( street TEXT, city TEXT, zip TEXT );

When to use:

1 / 0