Postgresql Loading data

Improve trees

Improving the trees dataset

In this document we explore and improve the trees dataset treesdb


Scope


The trees dataset

The raw version of the dataset is available from the Paris Open data portal

This includes street trees (along roads), those in green spaces and municipal public facilities. It does not include groves (bosquets), forest areas, trees on private property, etc.

Please note: this data is not updated in real time and significant discrepancies may appear (particularly in green spaces) and its scope covers neither all of the City’s actions nor private trees.

note: the data from the website may differ from the dataset we are working with.


Trees planted in Paris

Why is the city planting so many trees ?

planted trees in 20-25

Conclusion of a study on how to live in a Paris at 50 degrees : plant trees.

See for instance: here (fr), and here (en)


Restore the database

we restore the plain SQL version of the backup

The plain SQL backup was created with

pg_dump --file "treesdb_v01.sql"
--host "localhost"
--port "5432"
--username "alexis" --no-password
--format=p --encoding "UTF8"
--section=pre-data --section=data --section=post-data
--no-owner --no-privileges
--create --clean --if-exists --verbose
--table "public.trees" "treesdb_v01"

The SQL backup file is available here

The export includes the CREATE DATABASE statement. So you do not have to create the database before restoring the dump.

Explanation of the pg_dump statement

That command makes a plain-text SQL dump (because --format=p) of just the table public.trees from the database treesdb_v01.

Key points:

restore the database from the SQL file

To restore the database from that backup sql file:

  1. download the file from https://skatai.com/assets/db/treesdb_v01.sql
  2. open a PSQL window
  3. run the sql file with
\i /path/to/your/treesdb_v01.sql

or in the terminal with

psql -U username -d postgres -f /path/to/your/treesdb_v01.sql

on Windows : the unix style path will not work. use the windows style path instead: forward slashes C:/path/to/your/treesdb_v01.sql instead of backslashes C:\path\to\your\treesdb_v01.sql.

To open a PSQL window in pgadmin

  1. launch pgadmin
  2. go to tools tab
  3. click on “PSQL Tool”

Binary backup

A binary dump is also available. It was created with the psql command pg_dump

pg_dump --file "treesdb_v01.dump" --host "localhost" --port "5432" --username "alexis" --no-password --format=c --large-objects --compress "9" --encoding "UTF8" --section=pre-data --section=data --section=post-data --no-owner --no-privileges --create --clean --if-exists --verbose "treesdb_v01"

The binary dump file is available here

Restoring the binary backup

Since the pg_dump command produced a custom-format binary dump (--format=c), we must use pg_restore.

I included --create and --clean, the dump already contains instructions to drop and recreate the database, so you typically restore into the Postgres cluster (not into an existing database).

Here’s the matching pg_restore command:

pg_restore --host "localhost" \
           --port "5432" \
           --username "alexis" \
           --no-password \
           --dbname "postgres" \
           --create \
           --clean \
           --if-exists \
           --verbose \
           "treesdb_v01.dump"

Explanation:

Note: in psql (terminal and pgadmin), you can check which folder you are in with the \! pwd command. Change directory with \! cd /path/to/your/directory


check

Once the database loaded, check that things are correct with

connect to the database

\c treesdb_v01

list the tables

\dt treesdb_v01

inspect the trees table

\d trees

check how many rows you have (211339)

select count(*) from trees;

inspect a random sample

select * from trees order by random() limit 1;

The columns of the tree table

Here are the columns of the tree table


Goal

The trees dataset is not perfect. there are anomalies, missing data

The table data types is not optimal for some columns and we are missing a primary key

This dataset is a good example of real world datasets that are never perfect.

In this document we get a sense of the data, deal with some anomalies and transform the table with more appropriate data types.

We also leverage the postGIS / earthdistance extension for spatial data.


data analysis

Let’s do some data analysis of the dataset.

select a random row from the trees db

     Column     |       value
----------------+-------------------
idbase         | 273252
location_type  | Arbre
domain         | Alignement
arrondissement | PARIS 18E ARRDT
suppl_address  |
number         |
address        | RUE DE LA CHAPELLE
id_location    | 000602007
name           | Tilleul
genre          | tilia
species        | cordata
variety        | Greenspire
circumference  | 34
height         | 5
stage          | Jeune (arbre)
geo_point_2d   | 48.89291084026716, 2.359807495821241
remarkable    | f

We see that we have:


explore

Let’s query the tree table and get a feeling for the values of the different columns

any other thing you can think of ?


PRIMARY KEY


Where’s the primary key ?

The table loaded from a csv has no primary key although idbase seems like a good candidate. (after all it has id in it… must mean something right ?)

Could the idbase column be a primary key ?


What’s a primary key and what is it used for ?

A primary key in SQL is a unique identifier for each record in a table.

A primary key is a constraint that enforces uniqueness and non-nullability for the column or columns it is applied to.

Having a primary key allows databases to index the table more efficiently, making searches and retrievals faster when accessing records by their primary key value.

foreign key

In relational databases, primary keys are often used as foreign keys in other tables. A foreign key is a column (or a combination of columns) that references a primary key in another table.

conditions for a primary key

A column is a good candidate as a primary key if:

and


SERIAL

A primary key is usually also SERIAL.

In PostgreSQL, SERIAL is a special data type used for auto-incrementing integer columns, commonly employed for primary keys.

When you define a column with SERIAL, PostgreSQL automatically creates a sequence and sets it up so that each new row gets the next value from this sequence. (we’ll come back to that in a moment)


back to serial

Based on this definition, the idbase column be a good candidate for primary key ?

Also, can it be serial ?

Let’s check uniqueness of the values first :

Write a query to find out if some values of idbase have more than 2 rows

-- your query

This query shows that there are 2 rows with the same idbase

Maybe these are exact duplicates of trees? Let’s check

You can select the trees that have duplicates idbase with

select * from trees where idbase in (
  select idbase from (
    SELECT idbase, COUNT(*)
    FROM trees
    GROUP BY idbase
    HAVING COUNT(*) > 1
  )
) order by idbase asc;

These trees have the exact same data except for some null values for heights and circumference which are different.

At this point there’s no way to know which is the tree with the true values.

Note: although since duplicate trees have the same geolocation, they most probably are the same tree. we could just delete one of the record an dkeep the tree with the most complete data or highest height and circumference.

So to make idbase the primary we would have to delete the one of the duplicated trees.

Note a SERIAL primary does not have to be continuous. It can have gaps.

So we have 2 options

The second option requires to delete some records and also to create a sequence.

So the 1st option (create new primary key) is the easiest one since the sequence is created automatically when adding a primary key.

The easiest solution is to create a new serial primary key

We can do that with

alter table trees add COLUMN id SERIAL PRIMARY KEY;

As expected, creating a primary key also creates a sequence ;

treesdb=# \d


 Schema |     Name     |   Type   | Owner
--------+--------------+----------+--------
 public | trees        | table    | alexis
 public | trees_id_seq | sequence | alexis

Sequence


Postgres Sequence

a sequence is a special database object designed to generate a sequence of unique, incremental numbers. It is commonly used to create auto-incrementing values, typically for columns like primary keys.


\d+ trees_id_seq

  Type   | Start | Minimum |  Maximum   | Increment | Cycles? | Cache
---------+-------+---------+------------+-----------+---------+-------
 integer |     1 |       1 | 2147483647 |         1 | no      |     1
Owned by: public.trees.id

and

\d+ trees

    Column     |       Type        | Collation | Nullable |              Default
----------------+-------------------+-----------+----------+-----------------------------------
 id             | integer           |           | not null | nextval('trees_id_seq'::regclass)

notice the nextval('trees_id_seq'::regclass) which increments the counter in the sequence each time it is called

also notice the new index

Indexes:
    "trees_pkey" PRIMARY KEY, btree (id)

We shall keep the idbase column for future references but we will use id as the primary key.


data types


Improving the column types

At this point, all columns are varchar (synonym for text in postgresSQL) except for the height and circumference (integers)

That does not make sense for columns such as: remarkable, or geo_point_2d (latitude and longitude)

remarkable should be a boolean

What are the values taken by remarkable ?

How to transform the column which has NON, OUI, or '' (empty string) as values insetad of regular boolean values t, f and null

Your task:

Use UPDATE and CASE to copy / convert the values

update trees
SET remarkable_bool =
    CASE remarkable
        WHEN 'OUI' THEN TRUE
        WHEN 'NON' THEN FALSE
        ELSE NULL  -- handles empty string and any other values
    END;


Geo location


geo_point_2d is a string

geo_point_2d holds the latitude and longitude of the trees. We could transform geo_point_2d as an array of floats. However there are specific data types for geo localization.

Using the proper data type will allow us to more easily carry out calculations specific to locations. For instance find the nearest trees, or calculate the distance between trees.

We have a choice to use a native POINT data type which is available by default in PostgreSQL or a PostGIS geography data type (needs the PostGIS extension).


Comparison between POINT and GEOGRAPHY

Feature POINT GEOGRAPHY
Coordinate System Flat, Cartesian (x, y) Spherical (longitude, latitude)
Earth’s Curvature Not accounted for Accounted for
Distance Calculations Euclidean (straight line on flat plane) Great circle (curved line on Earth’s surface)
Accuracy over Large Distances Less accurate Maintains global accuracy
Performance Generally faster for basic operations May be slower but more accurate for geographic calculations
Use Cases Local, small-scale applications (e.g., floor plans, 2D games) Global, large-scale geographic applications (e.g., GPS, GIS)
Additional Functionality Limited to basic geometric operations Extensive GIS functions available through PostGIS
Data Representation Simple (x, y) coordinates Complex spheroidal calculations
Spatial Reference System Typically assumes a flat plane Supports various geographic coordinate systems (e.g., WGS84)
Storage Size Smaller Larger due to additional metadata

PostGIS vs Earth distance

Installing the PostGIS extension is a bit complex. We will use the native POINT data type for now and the earthdistance extension.

see this article for a good comparison: https://hashrocket.com/blog/posts/juxtaposing-earthdistance-and-postgis


How to use earthdistance

Install earthdistance extension

in PSQL

CREATE EXTENSION cube;
CREATE EXTENSION earthdistance;

check that both extensions are installed with

select * from pg_extension;

or

\dx

in pgadmin, expand the databse name and click right on the expansion menu

earthdistance_extension


Calculate distances between 2 points

Then you can select distances between 2 points

SELECT earth_distance(
  ll_to_earth(48.8566, 2.3522),   -- Paris
  ll_to_earth(51.5074, -0.1278)   -- London
) AS distance_meters;

This gives the distance in meters.


using POINT in earthdistance

⚡ If you currently store locations as POINT

If you have a column like:

location POINT

You can still use it, just extract its coordinates:

SELECT earth_distance(
  ll_to_earth( location[1], location[0] ),  -- POINT(x,y) = (lon,lat)
)
FROM places;

⚠️ Important: POINT(x, y) is usually (x=longitude, y=latitude), so make sure you pass them as (lat, lon) to ll_to_earth.


Transform the geo_point_2d from varchar to POINT

We use the following functions SPLIT_PART, TRIM and CAST to extract the latitude and longitude from the string. See postgres documentation strings functions

Your query should look like that

UPDATE trees
SET geolocation = point(
    CAST(TRIM(SPLIT_PART(geo_point_2d, ',', 2)) AS double precision), -- x = longitude
    CAST(TRIM(SPLIT_PART(geo_point_2d, ',', 1)) AS double precision)  -- y = latitude
)
-- here check that geo_point_2d is not null and not empty and has a comma
WHERE geo_point_2d IS NOT NULL
  AND TRIM(geo_point_2d) <> ''
  AND POSITION(',' IN geo_point_2d) > 0;

finally delete the geo_point_2d column

ALTER TABLE trees DROP COLUMN IF EXISTS geo_point_2d;

Closest tree

Given a tree index (id = 1234), find the N closest trees

WITH anchor AS (
  SELECT x(geolocation) AS lon, y(geolocation) AS lat
  FROM trees
  WHERE id = 1234
)
SELECT t.id,
       earth_distance(
         ll_to_earth(a.lat, a.lon),
         ll_to_earth(y(t.geolocation), x(t.geolocation))
       ) AS meters
FROM anchor a
JOIN trees t ON t.id <> 1234
ORDER BY meters
LIMIT 10;  -- put N here

What the WITH (CTE) does

anchor runs once to fetch the reference tree’s coordinates.

It exposes them as a.lon and a.lat for the main query, avoiding repeated subqueries.

Why y(t.geolocation)?

In Postgres, the point type is (x, y) = (longitude, latitude).

The helper functions:

So for each row t, we pass y(t.geolocation) (lat) first, then x(t.geolocation) (lon).

We can then adapt the query to find the trees near a given location if we know its coordinates


Trees with crazy measurements

Anomalies in the data


Some common queries

To finish our work on the trees table, we’d like to flag trees that have anomalies in terms of height and circumference.

Let’s look at some stats first to undertsand what normal data looks like

To detect anomalies for a given column, you can get a good insight about a variable distribution by writing a query to find : min, max, average, median, 95 and 5 percentiles for a given float column. This mimics the df.describe() in pandas data frames.

You can use the function PERCENTILE_CONT(p) with p as the percentile value (0.5 for median, 0.9, 0.95 0.99 etc …)

Note: you need to sort the values before applying PERCENTILE_CONT and calculating the percentile.

The right way to calculate a percentile in sql is

PERCENTILE_CONT(0.5)           -- What: Find the 50th percentile
WITHIN GROUP                    -- Where: Within the current group of rows
(ORDER BY height)              -- How: Order by height to determine position
SELECT
  MIN(height)        AS min_height,
  MAX(height)        AS max_height,
  AVG(height)        AS avg_height,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY height) AS median_height,
  PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY height) AS p05_height,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY height) AS p95_height,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY height) AS p99_height

FROM trees;
SELECT COUNT(*) AS trees_above_p99
FROM trees
WHERE height > (
  SELECT PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY height)
  FROM trees
);

notice the embedded select clauses

Check which trees (id) have height greater than the 99th percentile, order by height desc, limit to 10;

SELECT id, height
FROM trees
WHERE height > (
  SELECT PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY height)
  FROM trees
)
ORDER BY height DESC
LIMIT 10;

Flag the trees with outlier height

So let’s create a BOOLEAN column called outlier that indicates that there’s an anomaly with a tree record.

ALTER TABLE trees ADD COLUMN outlier boolean;

the circumference is in centimeters while the height is in meters and in the original dataset, both are recorded as ints (no decimal points are available)

Use the 99 percentile as a threshold for anomalous values.

UPDATE trees
SET outlier = height > (
  SELECT PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY height)
  FROM trees
);

Now let’s check for circumference anomalies

but first let’s convert the circumference to diameter (diameter = circumference / pi).

-- add a new column diameter as float

and update the diameter column with circumference/pi

-- your query

Now find trees that have a insanely high diameter (calculate the 99 percentile of the diameter)

Uupdate the anomaly column with these trees

And finally also set the anomaly column to true for duplicates of idbase


other anomalies ?

Are there other anomalies such as duplicates addresses ?

Although zero values for height and circumference could simply indicate a young small tree whose measures have been rounded down a bit harshly.

It may be difficult to decide if the height or diameter of a tree is an anomaly or not.

for instance the tree 187635 has a height of 98m.

column value
idbase 2018097
location_type Arbre
domain Alignement
arrondissement PARIS 18E ARRDT
suppl_address 108V
number  
address RUE DE LA CHAPELLE
id_location 2002004
name Platane
genre Platanus
species x hispanica
variety  
circumference 68
height 98
stage Jeune (arbre)
geo_point_2d 48.89815810816667, 2.3591531336170086
id 187635
remarkable f
diameter 21.645072260497766
geolocation (2.3591531336170086,48.89815810816667)

that’s a lot but is it a valid height for a tree? Are they trees that tall in Paris ?

We can investigate in 2 ways

input the coordinates in google maps and look at the photo of the street

or check the height of nearby trees with the closest trees query

WITH given_tree AS (
  SELECT id, geolocation
  FROM trees
  WHERE id = 187635
)
SELECT
  t.id,
  t.height,
  t.geolocation,
  ROUND(
    SQRT(
      POW((t.geolocation[0] - gt.geolocation[0]) * 73000, 2) +
      POW((t.geolocation[1] - gt.geolocation[1]) * 111000, 2)
    )
  ) AS distance_meters
FROM trees t, given_tree gt

ORDER BY t.geolocation <-> gt.geolocation
LIMIT 9;

All surrounding trees have a height of 5 to 16 meters. So 98 meters is not a valid measurement.

Finally let’s dump the database

using pgAdmin export the database as plain format with name treesdb_v03.sql


Recap

The table loaded from the csv / sql dump was lacking a primary key, proper datatypes and had many data anomalies

The current table is much more clean and in a state more compatible with production.

Next we move away from the single table database and start building a proper relational database from that dataset.