In this document we are working with the treesdb_v02 database.
The file is available here treesdb_02.sql
| database | version |
|---|---|
| treesdb_v01 | raw data from the original csv, one table trees |
| treesdb_v02 | improved trees table with primary key, geopoints and identified height and circumference anomalies |
| treesdb_v03 | normalized database |
the file was created with
pg_dump --file "treesdb_v02.sql"
--host "localhost"
--port "5432" --username "alexis" --no-password --format=p --encoding "UTF8" --section=pre-data --section=data --section=post-data --no-owner --no-privileges --create --clean --if-exists --verbose --table "public.trees" "treesdb_v02"
Download the file to your local machine
Then
either connect to the default postgres database in the terminal
or in pgadmin, connect to the server and click on the tools tab, select PSQL
then in the PSQL window
\i <path to your file>/treesdb_v02.sql
On windows whatch out for the backslash, double quotes etc
This file will create the database treesdb_v02 and the trees table
Test that the data was uploaded. Execute the following query :
select * from trees order by random() limit 1;
This should return the number of tree records in the trees table.
check the trees table structure
\d trees
The goal of this exercise is to transform the flat database , one table, trees into a fully normalized database by applying 1NF, 2NF and 3NF forms.
For each entity that needs a dedicated table, the process is:
The entity can be composed of multiple original columns.
For instance, it makes sense to regroup address, arrondissement, geolocation and suppl_address in a location table.
nameThe name column in the trees table is a categorical column.
name)The column name is a good candidate for normalization.
tree_names. Very simple, it has one primary key and only one column namecreate table tree_names (
id serial primary key,
name varchar unique not null
);
Note the unique and not null constraints. There is no reason to have duplicates or null values in this table.
Having null values in the tree_names table would defeat the purpose of normalization.
INSERT INTO tree_names (name)
SELECT DISTINCT t.name
FROM trees t
WHERE t.name IS NOT NULL;
note that the insert query gets the values from another query
name_idALTER TABLE trees ADD COLUMN name_id INTEGER;
trees.name_id <-> tree_names.id matching on both tables name columns.UPDATE trees t
SET name_id = tn.id
FROM tree_names tn
WHERE t.name = tn.name;
ALTER TABLE trees
ADD CONSTRAINT fk_tree_names
FOREIGN KEY (name_id)
REFERENCES tree_names(id);
The query should return 0 rows
select count(*)
from trees t
join tree_names tn on tn.id = t.name_id
where t.name != tn.name;
ALTER TABLE trees
DROP COLUMN name;
Notice that the names in the new tree_names table follow no particular order.
To keep things really tidy, it would be great to have the name column be ordered alphabetically
To do so we must modify the insert query:
INSERT INTO tree_names (name)
SELECT DISTINCT name
FROM trees
WHERE name IS NOT NULL;
And use instead:
INSERT INTO tree_names (name)
SELECT DISTINCT name
FROM trees t
WHERE t.name IS NOT NULL
ORDER BY t.name asc;
Exercice: Insert the names ordered by their frequency of occurences in the trees table so that most frequent names are at the top of the tree_names table.
INSERT INTO tree_names (name)
SELECT name
FROM trees t
WHERE t.name IS NOT NULL
GROUP BY name
ORDER BY COUNT(*) DESC;
Now to get the name of a tree we need to JOIN the tables trees and tree_names.
select t.*, tn.*
from trees t
join tree_names tn on tn.id = t.name_id
where t.id = 888;
Though a little bit more complex than a simple query, we now can
tree_names table,Which logical entities do we need to extract and regroup in a dedicated table?
Let's look at the columns and group them by logical entities
We can keep the ids in the trees table:
idid_locationidbaseSame with the flags remarkable and outlier
Let's start with the columns domain & stage.
Both columns are categorical columns, show insertion, update and deletion anomalies an have null values. So they should each be given their dedicated table:
It makes sense to regroup all the columns associated with the localization of the tree.
Note the weird format of the arrondissement, the lack of postal code etc ... These addresses are not standard.
One step in the normalization process consists in reconciling the keys by matching the value of the fields in the main and secondary tables. For the locations table we could assume that each trees has a unique geolocation. And we would be wrong!
There are 12 geolocation duplicates in the trees table.
SELECT COUNT(*) as tree_count, geolocation::text
FROM locations
GROUP BY geolocation::text
HAVING COUNT(*) > 1
ORDER BY tree_count DESC;
A quick check will show that geolocation duplicates all have the same address. So we can delete these duplicates and fall back to reconciling the keys based on the geolocation points.
The sql to delete the geolocation duplicates is more complex.
WITH numbered_duplicates AS (
SELECT id, geolocation,
ROW_NUMBER() OVER (PARTITION BY geolocation::text ORDER BY id) as row_num
FROM locations
WHERE geolocation::text IN (
SELECT geolocation::text
FROM locations
GROUP BY geolocation::text
HAVING COUNT(*) > 1
)
)
DELETE FROM locations
WHERE id IN (
SELECT id
FROM numbered_duplicates
WHERE row_num > 1
);
Note: Another way to avoid duplicates is to cast geolocation as text and use
insert from select distinct in the insert query and then to recast geolocation as point.
That may be simpler.
There is a national database of addresses called BAN (Base Nationale d'Adresse) available at adresse.data.gouv.fr/base-adresse-nationale
It has an api and you can find an address given location coordinates
curl "https://api-adresse.data.gouv.fr/reverse/?lon=2.323473607304248&lat=48.88237276393427&limit=1"
returns
{
"type":"FeatureCollection",
"version":"draft",
"features":[
{
"type":"Feature",
"geometry":{"type":"Point","coordinates":[2.323387,48.882396]},
"properties":{
"label":"25 Rue Clapeyron 75008 Paris",
"score":0.9999999813107335,
"housenumber":"25",
"id":"75108_2067_00025",
"name":"25 Rue Clapeyron",
"postcode":"75008",
"citycode":"75108",
"x":650379.56,
"y":6864921.3,
"city":"Paris".....
We could sanitize the addresses by feeding the long, lat into the API, and adding the properly formatted address to the location of the trees. But that's out of the scope of the exercise
Taxonomy includes everythign related to the nature of the tree. In our case it englobes the columns : name, genre, species and variety
Note that there's is an intrinsic relation between the name, genre, species and variety of a tree. Not sure which one. but this hierarchy probably should be kept.
So, our options are:
or
name, genre, species, varietyWe will implement the second choice to conserve the relation between the trees name, genre etc ....
Should the tree measurements be kept in the trees table or should they have their own separate table ?
To answer consider these questions
One piece of data is painfully missing from the trees table ... we don't know when each tree was last measured, there is no survey date.
If a survey date was available then we could record multiple measures for a given tree. It would definitely make sense to give the measurements a separated dedicated table.
However since there is no date for each measurements, each tree has one unique set of measures that is not going to change often (or so we guess). So let's keep the measurements in the trees table to keep things simple.
The target structure is
| column | table |
|---|---|
| id | trees |
| id_location | trees |
| idbase | trees |
| remarkable | trees |
| anomaly | trees |
| height | trees |
| circumference | trees |
| diameter | trees |
| column | table |
|---|---|
| address | locations |
| suppl_address | locations |
| arrondissement | locations |
| geolocation | locations |
| column | table |
|---|---|
| domain | tree_domains |
| stage | tree_stages |
| column | table |
|---|---|
| name | tree_names |
| genre | tree_genres |
| species | tree_species |
| variety | tree_varieties |
We also create the taxonomy table to link the tree to its name, genre, species and variety

In the example given above we normalized the name column.
In the target schema, the tree_names table is linked to the taxonomy table not directly to the trees table.
To avoid confusion you should restart from scratch and restore the treesdb_v02 database.