Postgresql database normalization lab on trees dataset

Normalization lab on trees

Normalize treesdb

Context

In this document we are working with the treesdb_v02 database.

If you haven’t done so already, connect to your local postgreSQL server and restore the file from the github repo

github > SkatAI > epitadb > data > treesdb_v02.01.sql.backup

In the terminal (mac or powershell ), on local

psql with (Mac):

psql -U <localusername> -d postgres

or (Windows):

psql -U postgres -d postgres

create the database treesdb_v03 , (1st delete it if it exists):

DROP database if exists treesdb_v03 WITH(force) ;

CREATE DATABASE treesdb_v03
    WITH
    OWNER = <postgres (win) or your_local_username (mac)>
    ENCODING = 'UTF8'
    LOCALE_PROVIDER = 'libc'
    CONNECTION LIMIT = -1
    IS_TEMPLATE = False;

Then back in the terminal (exit psql with \q)

pg_restore -d treesdb_v03 \
-U postgres \
--no-owner \
--no-privileges \
--no-data-for-failed-tables \
--section=pre-data \
--section=data \
--section=post-data \
--verbose \
--exit-on-error  \
--single-transaction \
<path to your file>/treesdb_v02.01.sql.backup

Nice

You can test that the data was uploaded from the command line in the terminal by using psql to connect execute a query at the same time :

psql  -U postgres -d treesdb_v03 -c "select * from trees order by random() limit 1;"

This should retturn the number of tree records in the trees table.

Goal

The goal of this exercise is to transform the flat, one table, treesdb database into a fully normalized database by applying 1NF, 2NF and 3NF forms.

For each entity that needs a dedicated table, the process is:

  1. create a new table with a primary key
  2. inserts values from the trees column
  3. add a foreign key to the trees table
  4. delete the original column

The entity can be composed of multiple original columns. For instance, it makes sense to regroup address, arrondissement, geolocation and suppl_address in a location table.

Example with name

The name column in the trees table is a categorical column.

The column name is a good candidate for normalization.

create table tree_names (
    id serial primary key,
    name varchar unique not null
);

Note the unique and not null constraints.

Having null values in the tree_names table would defeat the purpose of normalization.

INSERT INTO tree_names (name)
SELECT DISTINCT t.name
FROM trees t
WHERE t.name IS NOT NULL;
ALTER TABLE trees ADD COLUMN name_id INTEGER;
UPDATE trees t
SET name_id = tn.id
FROM tree_names tn
WHERE t.name = tn.name;
ALTER TABLE trees
ADD CONSTRAINT fk_tree_names
FOREIGN KEY (name_id)
REFERENCES tree_names(id);

The query should return 0 rows

select count(*)
from trees t
join tree_names tn on tn.id = t.name_id
where t.name != tn.name;
ALTER TABLE trees
DROP COLUMN name;

variations

Notice that the names in the new tree_names table follow no particular order.

To keep things really tidy, it would be great to have the name column be ordered alphabetically

To do so we must modify the insert query:

INSERT INTO tree_names (name)
SELECT DISTINCT name
FROM trees
WHERE name IS NOT NULL;

And use instead:

INSERT INTO tree_names (name)
SELECT DISTINCT name
FROM trees t
WHERE t.name IS NOT NULL
ORDER BY t.name asc;

Exercice: Insert the names ordered by their frequency of occurences in the trees table so that most frequent names are at the top of the tree_names table.

Query the names

Now to get the name of a tree we need to JOIN the tables trees and tree_names.

select t.*, tn.*
from trees t
join tree_names tn on tn.id = t.name_id
where t.id = 888;

Though a little bit more complex than a simple query, we now can


Identify the logical entities

Which logical entities do we need to extract and regroup in a dedicated table?

Let’s look at the columns and group them by logical entities

Ids

We can keep the ids in the trees table:

Same with the flags remarkable and anomaly

Domain & Stage

Let’s start with the columns domain & stage.

Both columns are categorical columns, show insertion, update and deletion anomalies an have null values. So they should each be given their dedicated table:

Addresses and locations

It makes sense to regroup all the columns associated with the localization of the tree.

Note the weird format of the arrondissement, the lack of postal code etc … These addresses are not standard.

Reconciling keys

One step in the normalization process consists in reconciling the keys by matching the value of the fields in the main and secondary tables. For the locations table we could assume that each trees has a unique geolocation. And we would be wrong!

There are 12 geolocation duplicates in the trees table.

SELECT COUNT(*) as tree_count, geolocation::text
FROM locations
GROUP BY geolocation::text
HAVING COUNT(*) > 1
ORDER BY tree_count DESC;

A quick check will show that geolocation duplicates all have the same address. So we can delete these duplicates and fall back to reconciling the keys based on the geolocation points.

The sql to delete the geolocation duplicates is more complex.

WITH numbered_duplicates AS (
    SELECT id, geolocation,
           ROW_NUMBER() OVER (PARTITION BY geolocation::text ORDER BY id) as row_num
    FROM locations
    WHERE geolocation::text IN (
        SELECT geolocation::text
        FROM locations
        GROUP BY geolocation::text
        HAVING COUNT(*) > 1
    )
)
DELETE FROM locations
WHERE id IN (
    SELECT id
    FROM numbered_duplicates
    WHERE row_num > 1
);

Note: Another way to avoid duplicates is to cast geolocation as text and use insert from select distinct in the insert query and then to recast geolocation as point. That may be simpler.

the BAN

There is a national database of addresses called BAN (Base Nationale d’Adresse) available at adresse.data.gouv.fr/base-adresse-nationale

It has an api and you can find an address given location coordinates

curl “https://api-adresse.data.gouv.fr/reverse/?lon=2.323473607304248&lat=48.88237276393427&limit=1”

returns

{
    "type":"FeatureCollection",
    "version":"draft",
    "features":[
        {
        "type":"Feature",
        "geometry":{"type":"Point","coordinates":[2.323387,48.882396]},
        "properties":{
            "label":"25 Rue Clapeyron 75008 Paris",
            "score":0.9999999813107335,
            "housenumber":"25",
            "id":"75108_2067_00025",
            "name":"25 Rue Clapeyron",
            "postcode":"75008",
            "citycode":"75108",
            "x":650379.56,
            "y":6864921.3,
            "city":"Paris".....

We could sanitize the addresses by feeding the long, lat into the API, and adding the properly formatted address to the location of the trees. But that’s out of the scope of the exercise

Taxonomy

Taxonomy includes everythign related to the nature of the tree. In our case it englobes the columns : name, genre, species and variety

Note that there’s is an intrinsic relation between the name, genre, species and variety of a tree. Not sure which one. but this hierarchy probably should be kept.

So, our options are:

or

We will implement the second choice to conserve the relation between the trees name, genre etc ….

Measurements

Should the tree measurements be kept in the trees table or should they have their own separate table ?

To answer consider these questions

One piece of data is painfully missing from the trees table … we don’t know when each tree was last measured, there is no survey date.

If a survey date was available then we could record multiple measures for a given tree. It would definitely make sense to give the measurements a separated dedicated table.

However since there is no date for each measurements, each tree has one unique set of measures that is not going to change often (or so we guess). So let’s keep the measurements in the trees table to keep things simple.

Final structure

The target structure is

column table
id trees
id_location trees
idbase trees
remarkable trees
anomaly trees
height trees
circumference trees
diameter trees
column table
address locations
suppl_address locations
arrondissement locations
geolocation locations
column table
domain tree_domains
stage tree_stages
column table
name tree_names
genre tree_genres
species tree_species
variety tree_varieties

We also create the taxonomy table to link the tree to its name, genre, species and variety

Note

In the example given above we normalized the name column.

In the target schema, the tree_names table is linked to the taxonomy table not directly to the trees table.

To avoid confusion you should restart from scratch and restore the treesdb_v02 database.