In this document we explore and improve the trees dataset treesdb
The raw version of the dataset is available from the Paris Open data portal
This includes street trees (along roads), those in green spaces and municipal public facilities. It does not include groves (bosquets), forest areas, trees on private property, etc.
Please note: this data is not updated in real time and significant discrepancies may appear (particularly in green spaces) and its scope covers neither all of the City's actions nor private trees.
note: the data from the website may differ from the dataset we are working with.
Why is the city planting so many trees ?
Conclusion of a study on how to live in a Paris at 50 degrees : plant trees.
See for instance: here (fr), and here (en)
we restore the plain SQL version of the backup
The plain SQL backup was created with
pg_dump --file "treesdb_v01.sql"
--host "localhost"
--port "5432"
--username "alexis" --no-password
--format=p --encoding "UTF8"
--section=pre-data --section=data --section=post-data
--no-owner --no-privileges
--create --clean --if-exists --verbose
--table "public.trees" "treesdb_v01"
The export includes the CREATE DATABASE statement. So you do not have to create the database before restoring the dump.
That command makes a plain-text SQL dump (because --format=p) of just the table public.trees from the database treesdb_v01.
Key points:
treesdb_v01.sql.trees table only.--create --clean --if-exists → SQL will drop and recreate the DB/table.--no-owner --no-privileges → skips ownership/GRANT statements.--verbose → prints progress.To restore the database from that backup sql file:
\i /path/to/your/treesdb_v01.sql
or in the terminal with
psql -U username -d postgres -f /path/to/your/treesdb_v01.sql
on Windows : the unix style path will not work. use the windows style path instead: forward slashes
C:/path/to/your/treesdb_v01.sqlinstead of backslashesC:\path\to\your\treesdb_v01.sql.
To open a PSQL window in pgadmin
A binary dump is also available. It was created with the psql command pg_dump
pg_dump --file "treesdb_v01.dump" --host "localhost" --port "5432" --username "alexis" --no-password --format=c --large-objects --compress "9" --encoding "UTF8" --section=pre-data --section=data --section=post-data --no-owner --no-privileges --create --clean --if-exists --verbose "treesdb_v01"
Since the pg_dump command produced a custom-format binary dump (--format=c), we must use pg_restore.
I included --create and --clean, the dump already contains instructions to drop and recreate the database, so you typically restore into the Postgres cluster (not into an existing database).
Here’s the matching pg_restore command:
pg_restore --host "localhost" \
--port "5432" \
--username "alexis" \
--no-password \
--dbname "postgres" \
--create \
--clean \
--if-exists \
--verbose \
"treesdb_v01.dump"
--dbname "postgres" → Need to connect to an existing database (usually postgres or template1) because the dump will recreate treesdb_v01.--create → ensures the database defined in the dump is created.--clean --if-exists → ensures objects are dropped before being recreated, matching the dump options.--verbose → prints progress.Note: in psql (terminal and pgadmin), you can check which folder you are in with the
\! pwdcommand. Change directory with\! cd /path/to/your/directory
Once the database loaded, check that things are correct with
connect to the database
\c treesdb_v01
list the tables
\dt treesdb_v01
inspect the trees table
\d trees
check how many rows you have (211339)
select count(*) from trees;
inspect a random sample
select * from trees order by random() limit 1;
Here are the columns of the tree table
idbase: integerlocation_type: character varyingdomain: character varyingarrondissement: character varyingsuppl_address: character varyingnumber: character varyingaddress: character varyingid_location: character varyingname: character varyinggenre: character varyingspecies: character varyingvariety: character varyingcircumference: integerheight: integerstage: character varyinggeo_point_2d: character varyingremarkable: booleanThe trees dataset is not perfect. there are anomalies, missing data
The table data types is not optimal for some columns and we are missing a primary key
This dataset is a good example of real world datasets that are never perfect.
In this document we get a sense of the data, deal with some anomalies and transform the table with more appropriate data types.
We also leverage the postGIS / earthdistance extension for spatial data.
Let's do some data analysis of the dataset.
select a random row from the trees db
Column | value
----------------+-------------------
idbase | 273252
location_type | Arbre
domain | Alignement
arrondissement | PARIS 18E ARRDT
suppl_address |
number |
address | RUE DE LA CHAPELLE
id_location | 000602007
name | Tilleul
genre | tilia
species | cordata
variety | Greenspire
circumference | 34
height | 5
stage | Jeune (arbre)
geo_point_2d | 48.89291084026716, 2.359807495821241
remarkable | f
We see that we have:
domain, arrondissementname, genre, species, variety,stage,height and circumference (in meters)address, suppl_address, number, ...remarkable flagLet's query the tree table and get a feeling for the values of the different columns
domain or arrondissementstage, genre, species ...domain, stage or remarkablelocation_type and number columns ?any other thing you can think of ?
The table loaded from a csv has no primary key although idbase seems like a good candidate. (after all it has id in it... must mean something right ?)
Could the idbase column be a primary key ?
A primary key in SQL is a unique identifier for each record in a table.
A primary key is a constraint that enforces uniqueness and non-nullability for the column or columns it is applied to.
Having a primary key allows databases to index the table more efficiently, making searches and retrievals faster when accessing records by their primary key value.
In relational databases, primary keys are often used as foreign keys in other tables. A foreign key is a column (or a combination of columns) that references a primary key in another table.
A column is a good candidate as a primary key if:
and
A primary key is usually also SERIAL.
In PostgreSQL, SERIAL is a special data type used for auto-incrementing integer columns, commonly employed for primary keys.
When you define a column with SERIAL, PostgreSQL automatically creates a sequence and sets it up so that each new row gets the next value from this sequence. (we'll come back to that in a moment)
Based on this definition, the idbase column be a good candidate for primary key ?
Also, can it be serial ?
Let's check uniqueness of the values first :
Write a query to find out if some values of idbase have more than 2 rows
-- your query
This query shows that there are 2 rows with the same idbase
Maybe these are exact duplicates of trees? Let's check
You can select the trees that have duplicates idbase with
select * from trees where idbase in (
select idbase from (
SELECT idbase, COUNT(*)
FROM trees
GROUP BY idbase
HAVING COUNT(*) > 1
)
) order by idbase asc;
These trees have the exact same data except for some null values for heights and circumference which are different.
At this point there's no way to know which is the tree with the true values.
Note: although since duplicate trees have the same geolocation, they most probably are the same tree. we could just delete one of the record an dkeep the tree with the most complete data or highest height and circumference.
So to make idbase the primary we would have to delete the one of the duplicated trees.
idbase values sequential / continuous ? (check min and max and total count of trees)Note a SERIAL primary does not have to be continuous. It can have gaps.
So we have 2 options
id as SERIAL primary keyidbase as SERIAL primary key after deleting one of the duplicated tree.The second option requires to delete some records and also to create a sequence.
So the 1st option (create new primary key) is the easiest one since the sequence is created automatically when adding a primary key.
The easiest solution is to create a new serial primary key
We can do that with
alter table trees add COLUMN id SERIAL PRIMARY KEY;
As expected, creating a primary key also creates a sequence ;
treesdb=# \d
Schema | Name | Type | Owner
--------+--------------+----------+--------
public | trees | table | alexis
public | trees_id_seq | sequence | alexis
a sequence is a special database object designed to generate a sequence of unique, incremental numbers. It is commonly used to create auto-incrementing values, typically for columns like primary keys.
\d+ trees_id_seq
Type | Start | Minimum | Maximum | Increment | Cycles? | Cache
---------+-------+---------+------------+-----------+---------+-------
integer | 1 | 1 | 2147483647 | 1 | no | 1
Owned by: public.trees.id
and
\d+ trees
Column | Type | Collation | Nullable | Default
----------------+-------------------+-----------+----------+-----------------------------------
id | integer | | not null | nextval('trees_id_seq'::regclass)
notice the nextval('trees_id_seq'::regclass) which increments the counter in the sequence each time it is called
also notice the new index
Indexes:
"trees_pkey" PRIMARY KEY, btree (id)
We shall keep the idbase column for future references but we will use id as the primary key.
At this point, all columns are varchar (synonym for text in postgresSQL) except for the height and circumference (integers)
That does not make sense for columns such as: remarkable, or geo_point_2d (latitude and longitude)
remarkable should be a booleanWhat are the values taken by remarkable ?
How to transform the column which has NON, OUI, or '' (empty string) as values insetad of regular boolean values t, f and null
Your task:
remarkable_boolUse UPDATE and CASE to copy / convert the values
update trees
SET remarkable_bool =
CASE remarkable
WHEN 'OUI' THEN TRUE
WHEN 'NON' THEN FALSE
ELSE NULL -- handles empty string and any other values
END;
geo_point_2d is a stringgeo_point_2d holds the latitude and longitude of the trees. We could transform geo_point_2d as an array of floats. However there are specific data types for geo localization.
Using the proper data type will allow us to more easily carry out calculations specific to locations. For instance find the nearest trees, or calculate the distance between trees.
We have a choice to use a native POINT data type which is available by default in PostgreSQL or a PostGIS geography data type (needs the PostGIS extension).
| Feature | POINT | GEOGRAPHY |
|---|---|---|
| Coordinate System | Flat, Cartesian (x, y) | Spherical (longitude, latitude) |
| Earth's Curvature | Not accounted for | Accounted for |
| Distance Calculations | Euclidean (straight line on flat plane) | Great circle (curved line on Earth's surface) |
| Accuracy over Large Distances | Less accurate | Maintains global accuracy |
| Performance | Generally faster for basic operations | May be slower but more accurate for geographic calculations |
| Use Cases | Local, small-scale applications (e.g., floor plans, 2D games) | Global, large-scale geographic applications (e.g., GPS, GIS) |
| Additional Functionality | Limited to basic geometric operations | Extensive GIS functions available through PostGIS |
| Data Representation | Simple (x, y) coordinates | Complex spheroidal calculations |
| Spatial Reference System | Typically assumes a flat plane | Supports various geographic coordinate systems (e.g., WGS84) |
| Storage Size | Smaller | Larger due to additional metadata |
Installing the PostGIS extension is a bit complex. We will use the native POINT data type for now and the earthdistance extension.
see this article for a good comparison: https://hashrocket.com/blog/posts/juxtaposing-earthdistance-and-postgis
earthdistancein PSQL
CREATE EXTENSION cube;
CREATE EXTENSION earthdistance;
check that both extensions are installed with
select * from pg_extension;
or
\dx
in pgadmin, expand the databse name and click right on the expansion menu
Then you can select distances between 2 points
SELECT earth_distance(
ll_to_earth(48.8566, 2.3522), -- Paris
ll_to_earth(51.5074, -0.1278) -- London
) AS distance_meters;
This gives the distance in meters.
POINTIf you have a column like:
location POINT
You can still use it, just extract its coordinates:
SELECT earth_distance(
ll_to_earth( location[1], location[0] ), -- POINT(x,y) = (lon,lat)
)
FROM places;
⚠️ Important: POINT(x, y) is usually (x=longitude, y=latitude), so make sure you pass them as (lat, lon) to ll_to_earth.
geo_point_2d from varchar to POINTgeolocation with type POINT.ALTER TABLE trees ADD COLUMN IF NOT EXISTS geolocation point;
ST_MakePoint(lon as float, lat as float) to create a point from the longitude and latitude.geo_point_2d columnWe use the following functions SPLIT_PART, TRIM and CAST to extract the latitude and longitude from the string.
See postgres documentation strings functions
Your query should look like that
UPDATE trees
SET geolocation = point(
CAST(TRIM(SPLIT_PART(geo_point_2d, ',', 2)) AS double precision), -- x = longitude
CAST(TRIM(SPLIT_PART(geo_point_2d, ',', 1)) AS double precision) -- y = latitude
)
-- here check that geo_point_2d is not null and not empty and has a comma
WHERE geo_point_2d IS NOT NULL
AND TRIM(geo_point_2d) <> ''
AND POSITION(',' IN geo_point_2d) > 0;
finally delete the geo_point_2d column
ALTER TABLE trees DROP COLUMN IF EXISTS geo_point_2d;
Given a tree index (id = 1234), find the N closest trees
WITH anchor AS (
SELECT x(geolocation) AS lon, y(geolocation) AS lat
FROM trees
WHERE id = 1234
)
SELECT t.id,
earth_distance(
ll_to_earth(a.lat, a.lon),
ll_to_earth(y(t.geolocation), x(t.geolocation))
) AS meters
FROM anchor a
JOIN trees t ON t.id <> 1234
ORDER BY meters
LIMIT 10; -- put N here
anchor runs once to fetch the reference tree’s coordinates.
It exposes them as a.lon and a.lat for the main query, avoiding repeated subqueries.
Why y(t.geolocation)?
In Postgres, the point type is (x, y) = (longitude, latitude).
The helper functions:
x(point) → returns x = longitude.y(point) → returns y = latitude.ll_to_earth(lat, lon) expects (latitude, longitude) in that order.So for each row t, we pass y(t.geolocation) (lat) first, then x(t.geolocation) (lon).
We can then adapt the query to find the trees near a given location if we know its coordinates
Anomalies in the data
To finish our work on the trees table, we'd like to flag trees that have anomalies in terms of height and circumference.
Let's look at some stats first to undertsand what normal data looks like
To detect anomalies for a given column, you can get a good insight about a variable distribution by writing a query to find : min, max, average, median, 95 and 5 percentiles for a given float column.
This mimics the df.describe() in pandas data frames.
You can use the function PERCENTILE_CONT(p) with p as the percentile value (0.5 for median, 0.9, 0.95 0.99 etc ...)
Note: you need to sort the values before applying
PERCENTILE_CONTand calculating the percentile.
The right way to calculate a percentile in sql is
PERCENTILE_CONT(0.5) -- What: Find the 50th percentile
WITHIN GROUP -- Where: Within the current group of rows
(ORDER BY height) -- How: Order by height to determine position
SELECT
MIN(height) AS min_height,
MAX(height) AS max_height,
AVG(height) AS avg_height,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY height) AS median_height,
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY height) AS p05_height,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY height) AS p95_height,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY height) AS p99_height
FROM trees;
SELECT COUNT(*) AS trees_above_p99
FROM trees
WHERE height > (
SELECT PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY height)
FROM trees
);
notice the embedded select clauses
Check which trees (id) have height greater than the 99th percentile, order by height desc, limit to 10;
SELECT id, height
FROM trees
WHERE height > (
SELECT PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY height)
FROM trees
)
ORDER BY height DESC
LIMIT 10;
So let's create a BOOLEAN column called outlier that indicates that there's an anomaly with a tree record.
ALTER TABLE trees ADD COLUMN outlier boolean;
the circumference is in centimeters while the height is in meters and in the original dataset, both are recorded as ints (no decimal points are available)
Use the 99 percentile as a threshold for anomalous values.
UPDATE trees
SET outlier = height > (
SELECT PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY height)
FROM trees
);
Now let's check for circumference anomalies
but first let's convert the circumference to diameter (diameter = circumference / pi).
-- add a new column diameter as float
and update the diameter column with circumference/pi
-- your query
Now find trees that have a insanely high diameter (calculate the 99 percentile of the diameter)
Uupdate the anomaly column with these trees
And finally also set the anomaly column to true for duplicates of idbase
Are there other anomalies such as duplicates addresses ?
Although zero values for height and circumference could simply indicate a young small tree whose measures have been rounded down a bit harshly.
It may be difficult to decide if the height or diameter of a tree is an anomaly or not.
for instance the tree 187635 has a height of 98m.
| column | value |
|---|---|
| idbase | 2018097 |
| location_type | Arbre |
| domain | Alignement |
| arrondissement | PARIS 18E ARRDT |
| suppl_address | 108V |
| number | |
| address | RUE DE LA CHAPELLE |
| id_location | 2002004 |
| name | Platane |
| genre | Platanus |
| species | x hispanica |
| variety | |
| circumference | 68 |
| height | 98 |
| stage | Jeune (arbre) |
| geo_point_2d | 48.89815810816667, 2.3591531336170086 |
| id | 187635 |
| remarkable | f |
| diameter | 21.645072260497766 |
| geolocation | (2.3591531336170086,48.89815810816667) |
that's a lot but is it a valid height for a tree? Are they trees that tall in Paris ?
We can investigate in 2 ways
input the coordinates in google maps and look at the photo of the street
or check the height of nearby trees with the closest trees query
WITH given_tree AS (
SELECT id, geolocation
FROM trees
WHERE id = 187635
)
SELECT
t.id,
t.height,
t.geolocation,
ROUND(
SQRT(
POW((t.geolocation[0] - gt.geolocation[0]) * 73000, 2) +
POW((t.geolocation[1] - gt.geolocation[1]) * 111000, 2)
)
) AS distance_meters
FROM trees t, given_tree gt
ORDER BY t.geolocation <-> gt.geolocation
LIMIT 9;
All surrounding trees have a height of 5 to 16 meters. So 98 meters is not a valid measurement.
using pgAdmin export the database as plain format with name treesdb_v03.sql
The table loaded from the csv / sql dump was lacking a primary key, proper datatypes and had many data anomalies
idbase column was not a good choice as a primary keyThe current table is much more clean and in a state more compatible with production.
Next we move away from the single table database and start building a proper relational database from that dataset.