Postgresql database normalization lab on trees dataset

Normalization lab on trees

Normalize treesdb

Context

In this document we are working with the treesdb_v02 database.

The file is available here treesdb_02.sql

database	version
treesdb_v01	raw data from the original csv, one table `trees`
treesdb_v02	improved `trees` table with primary key, geopoints and identified height and circumference anomalies
treesdb_v03	normalized database

the file was created with

pg_dump --file "treesdb_v02.sql"
--host "localhost"
--port "5432" --username "alexis" --no-password --format=p --encoding "UTF8" --section=pre-data --section=data --section=post-data --no-owner --no-privileges --create --clean --if-exists --verbose --table "public.trees" "treesdb_v02"

Download the file to your local machine

Then

restore with psql (pgadmin and terminal)

either connect to the default postgres database in the terminal

or in pgadmin, connect to the server and click on the tools tab, select PSQL

then in the PSQL window

\i <path to your file>/treesdb_v02.sql

On windows whatch out for the backslash, double quotes etc

This file will create the database treesdb_v02 and the trees table

Verify

Test that the data was uploaded. Execute the following query :

select * from trees order by random() limit 1;

This should return the number of tree records in the trees table.

check the trees table structure

\d trees

Goal

The goal of this exercise is to transform the flat database , one table, trees into a fully normalized database by applying 1NF, 2NF and 3NF forms.

For each entity that needs a dedicated table, the process is:

create a new table with a primary key
inserts values from the trees column
add a foreign key to the trees table
delete the original column

The entity can be composed of multiple original columns.

For instance, it makes sense to regroup address, arrondissement, geolocation and suppl_address in a location table.

Example with `name`

The name column in the trees table is a categorical column.

the most frequent tree name is Platane with 42841 occurences followed by 24945 marronniers and 22317 tilleuls.
for a total of 214 different names.
28 names correspond to one tree. Deletion anomaly
186 names occur more than once. Update anomaly
1339 trees have a NULL value for name)

The column name is a good candidate for normalization.

First create a new table called tree_names. Very simple, it has one primary key and only one column name

create table tree_names (
    id serial primary key,
    name varchar unique not null
);

Note the unique and not null constraints. There is no reason to have duplicates or null values in this table.

Having null values in the tree_names table would defeat the purpose of normalization.

Then inserts values from the trees column

INSERT INTO tree_names (name)
SELECT DISTINCT t.name
FROM trees t
WHERE t.name IS NOT NULL;

note that the insert query gets the values from another query

create a foreign key in the trees table: name_id

ALTER TABLE trees ADD COLUMN name_id INTEGER;

then update the foreign key column with the correct IDs: trees.name_id <-> tree_names.id matching on both tables name columns.

UPDATE trees t
SET name_id = tn.id
FROM tree_names tn
WHERE t.name = tn.name;

finally add the foreign key constraint

ALTER TABLE trees
ADD CONSTRAINT fk_tree_names
FOREIGN KEY (name_id)
REFERENCES tree_names(id);

check that all the names in the trees table are present in the tree_names table

The query should return 0 rows

select count(*)
from trees t
join tree_names tn on tn.id = t.name_id
where t.name != tn.name;

finally, delete the name column in the main trees table

ALTER TABLE trees
DROP COLUMN name;

variations

Notice that the names in the new tree_names table follow no particular order.

To keep things really tidy, it would be great to have the name column be ordered alphabetically

To do so we must modify the insert query:

INSERT INTO tree_names (name)
SELECT DISTINCT name
FROM trees
WHERE name IS NOT NULL;

And use instead:

INSERT INTO tree_names (name)
SELECT DISTINCT name
FROM trees t
WHERE t.name IS NOT NULL
ORDER BY t.name asc;

Exercice: Insert the names ordered by their frequency of occurences in the trees table so that most frequent names are at the top of the tree_names table.

INSERT INTO tree_names (name)
SELECT name
FROM trees t
WHERE t.name IS NOT NULL
GROUP BY name
ORDER BY COUNT(*) DESC;

Query the names

Now to get the name of a tree we need to JOIN the tables trees and tree_names.

select t.*, tn.*
from trees t
join tree_names tn on tn.id = t.name_id
where t.id = 888;

Though a little bit more complex than a simple query, we now can

insert new names (for instance for future trees) simply by adding a new row in the tree_names table,
update a particular name without updating the trees table,
and even delete a tree with a unique name without making the name disappear.

Identify the logical entities

Which logical entities do we need to extract and regroup in a dedicated table?

localisation
naming, species, genre, variety ?
dimensions
what about idbase, remarkable, and outlier ? can these stay in the trees table ?
does arrondissement need its own table ?

Let’s look at the columns and group them by logical entities

Ids

We can keep the ids in the trees table:

id
id_location
idbase

Same with the flags remarkable and outlier

Domain & Stage

Let’s start with the columns domain & stage.

Both columns are categorical columns, show insertion, update and deletion anomalies an have null values. So they should each be given their dedicated table:

Addresses and locations

It makes sense to regroup all the columns associated with the localization of the tree.

address
suppl_address
arrondissement
geolocation

Note the weird format of the arrondissement, the lack of postal code etc … These addresses are not standard.

Reconciling keys

One step in the normalization process consists in reconciling the keys by matching the value of the fields in the main and secondary tables. For the locations table we could assume that each trees has a unique geolocation. And we would be wrong!

There are 12 geolocation duplicates in the trees table.

SELECT COUNT(*) as tree_count, geolocation::text
FROM locations
GROUP BY geolocation::text
HAVING COUNT(*) > 1
ORDER BY tree_count DESC;

A quick check will show that geolocation duplicates all have the same address. So we can delete these duplicates and fall back to reconciling the keys based on the geolocation points.

The sql to delete the geolocation duplicates is more complex.

WITH numbered_duplicates AS (
    SELECT id, geolocation,
           ROW_NUMBER() OVER (PARTITION BY geolocation::text ORDER BY id) as row_num
    FROM locations
    WHERE geolocation::text IN (
        SELECT geolocation::text
        FROM locations
        GROUP BY geolocation::text
        HAVING COUNT(*) > 1
    )
)
DELETE FROM locations
WHERE id IN (
    SELECT id
    FROM numbered_duplicates
    WHERE row_num > 1
);

Note: Another way to avoid duplicates is to cast geolocation as text and use insert from select distinct in the insert query and then to recast geolocation as point. That may be simpler.

the BAN

There is a national database of addresses called BAN (Base Nationale d’Adresse) available at adresse.data.gouv.fr/base-adresse-nationale

It has an api and you can find an address given location coordinates

curl “https://api-adresse.data.gouv.fr/reverse/?lon=2.323473607304248&lat=48.88237276393427&limit=1”

returns

{
    "type":"FeatureCollection",
    "version":"draft",
    "features":[
        {
        "type":"Feature",
        "geometry":{"type":"Point","coordinates":[2.323387,48.882396]},
        "properties":{
            "label":"25 Rue Clapeyron 75008 Paris",
            "score":0.9999999813107335,
            "housenumber":"25",
            "id":"75108_2067_00025",
            "name":"25 Rue Clapeyron",
            "postcode":"75008",
            "citycode":"75108",
            "x":650379.56,
            "y":6864921.3,
            "city":"Paris".....

We could sanitize the addresses by feeding the long, lat into the API, and adding the properly formatted address to the location of the trees. But that’s out of the scope of the exercise

Taxonomy

Taxonomy includes everythign related to the nature of the tree. In our case it englobes the columns : name, genre, species and variety

Note that there’s is an intrinsic relation between the name, genre, species and variety of a tree. Not sure which one. but this hierarchy probably should be kept.

So, our options are:

one table per column (2NF) (but we loses the relation between the different categories)

one taxonomy table (keeps the relations between the different categories but will require having NULL values)
and one table per column: name, genre, species, variety

We will implement the second choice to conserve the relation between the trees name, genre etc ….

Measurements

Should the tree measurements be kept in the trees table or should they have their own separate table ?

circumference
height
diameter

To answer consider these questions

Functional dependency: Do these dimensions depend solely on the tree’s ID, or could they change independently?
Update anomalies: Would keeping them in the main table cause update anomalies?
Data redundancy: Is there any repetition of this data across multiple trees?
Query patterns: How frequently are these dimensions accessed along with other tree data?

One piece of data is painfully missing from the trees table … we don’t know when each tree was last measured, there is no survey date.

If a survey date was available then we could record multiple measures for a given tree. It would definitely make sense to give the measurements a separated dedicated table.

However since there is no date for each measurements, each tree has one unique set of measures that is not going to change often (or so we guess). So let’s keep the measurements in the trees table to keep things simple.

Final structure

The target structure is

column	table
id	trees
id_location	trees
idbase	trees
remarkable	trees
anomaly	trees
height	trees
circumference	trees
diameter	trees

column	table
address	locations
suppl_address	locations
arrondissement	locations
geolocation	locations

column	table
domain	tree_domains
stage	tree_stages

column	table
name	tree_names
genre	tree_genres
species	tree_species
variety	tree_varieties

We also create the taxonomy table to link the tree to its name, genre, species and variety

Note

In the example given above we normalized the name column.

In the target schema, the tree_names table is linked to the taxonomy table not directly to the trees table.

To avoid confusion you should restart from scratch and restore the treesdb_v02 database.