We'll work on the normalized version of the trees db
download the sql backup from https://skatai.com/assets/db/treesdb_v03_normalized.sql
restore the database
from the terminal
psql -d postgres -f <path to the sql backup>/treesdb_v03_normalized.sql
or within a PSQL session
\i <path to the sql backup>/treesdb_v03_normalized.sql
This will create a new database called treesdb_v03_normalized
Check out the tables, make a few queries to get a sense of the data.
The ERD is
Why: Before diving into analysis, we need to know: How many trees do we have? How many are missing critical measurements? What's the data quality like?
Goal: Show overall counts and percentages of NULL values for key measurements.
Hint: Use COUNT(*) OVER() for total trees, and COUNT(column) OVER() for non-null counts. Calculate percentages by dividing counts.
Compare with
GROUP BYandCOUNT(*)sql queries to see the difference.
Query Structure:
SELECT
-- Show a sample of tree IDs
-- Calculate total trees using window function
-- Count non-null heights
-- Count non-null circumferences
-- Calculate percentage of trees with height data
FROM trees
LIMIT 10;
Key Insight: All rows show the same totals because window functions without PARTITION BY operate on the entire dataset.
Why: Paris has 20 arrondissements, the dataset also includes surrounding parks. We want to understand if data quality varies by location - are some districts better at recording tree measurements?
Goal: For each arrondissement, count total trees and trees with valid measurements (height > 0).
Hint:
Query Structure:
SELECT DISTINCT
l.arrondissement,
COUNT(*) OVER(PARTITION BY ...) AS total_trees,
-- Count trees where height > 0 within each arrondissement
-- Calculate percentage
FROM trees t
JOIN locations l ON ...
WHERE l.arrondissement IS NOT NULL
ORDER BY ...;
Key Insight: Using DISTINCT gives us one row per arrondissement. We see which districts have better data quality.
note using window functions and
distinctis overkill. A more simple way to get the results is simply to usegroup byandcount
For instance
SELECT
l.arrondissement,
COUNT(*) AS total_trees,
COUNT(CASE WHEN t.height > 0 THEN 1 END) AS trees_with_valid_height,
ROUND(
COUNT(CASE WHEN t.height > 0 THEN 1 END)::numeric /
COUNT(*) * 100,
2
) AS pct_valid_height
FROM trees t
JOIN locations l ON t.location_id = l.id
WHERE l.arrondissement IS NOT NULL
GROUP BY l.arrondissement
ORDER BY l.arrondissement;
But for the sake of the exercise, try to write the query using window functions and distinct.
You should get this result:
arrondissement | total_trees | trees_with_valid_height | pct_valid_height
-------------------+-------------+-------------------------+------------------
BOIS DE BOULOGNE | 4143 | 4130 | 99.69
BOIS DE VINCENNES | 11804 | 9256 | 78.41
...
Why: The stage field has inconsistent values. We need to see what stages exist and how many trees are in each stage vs how many have NULL stage.
Goal: Show the distribution of trees across different development stages, including NULL values.
Hint:
Query Structure:
SELECT DISTINCT
COALESCE(ts.stage, 'Unknown/NULL') AS stage,
COUNT(*) OVER(PARTITION BY ...) AS trees_in_stage,
ROUND(COUNT(*) OVER(PARTITION BY ...)::numeric / COUNT(*) OVER() * 100, 2) AS pct_of_total
FROM trees t
LEFT JOIN tree_stages ts ON ...
ORDER BY trees_in_stage DESC;
Key Insight: You'll likely see that many trees have NULL stage, and the stage names are inconsistent ('Jeune Arbre(Adulte)' mixed with 'Adulte').
The 2 first lines of results are:
stage | trees_in_stage | pct_of_total
---------------------+----------------+--------------
Adulte | 79627 | 37.68
Unknown/NULL | 46742 | 22.12
Why: We have outliers and zeros in the height data. Let's see the distribution: What are typical heights? Where are the outliers?
Goal: Calculate percentiles and quartiles for tree heights to understand the distribution, excluding zeros.
Hint:
Query Structure:
SELECT
id,
height,
-- percentile rank
-- quartile
-- min/max within quartile
FROM trees
WHERE height > 0
ORDER BY height -- DESC or ASC
LIMIT 100;
Key Insight: By ordering DESC and limiting, you can see the outliers. Try changing to ORDER BY height ASC to see the smallest values.
Why: Some measurements might be data entry errors. Let's find trees that are unusually tall or short compared to their domain (public parks vs streets have different typical sizes).
Goal: Compare each tree's height to the average in its domain and flag extreme outliers.
Hint:
Query Structure:
SELECT
t.id,
td.domain,
t.height,
ROUND(AVG(t.height) OVER(PARTITION BY ...)::numeric, 2) AS domain_avg_height,
ROUND(STDDEV(t.height) OVER(PARTITION BY ...)::numeric, 2) AS domain_stddev,
-- Calculate z-score
CASE WHEN ... THEN 'Outlier' ELSE 'Normal' END AS outlier_flag
FROM trees t
LEFT JOIN tree_domains td ON t.domain_id = td.id
WHERE t.height > 0
ORDER BY ABS((height - AVG...) / STDDEV...) DESC
LIMIT 50;
Key Insight: NULLIF prevents division by zero. You'll see which trees have suspicious measurements.
Why: Each arrondissement might want to showcase its most impressive trees. Let's find the top 5 tallest trees in each district.
Goal: Rank trees by height within each arrondissement and filter to top 5.
Hint:
Query Structure:
SELECT
l.arrondissement,
tn.name AS tree_name,
t.height,
RANK() OVER(...) AS height_rank_in_arr
FROM trees t
JOIN locations l ON ...
JOIN taxonomy tax ON ...
LEFT JOIN tree_names tn ON ...
WHERE t.height > 0 AND l.arrondissement IS NOT NULL
ORDER BY l.arrondissement, height_rank_in_arr
-- Note: This will show many rows. How would you filter to top 5 per arrondissement?
LIMIT 100;
Note: In PostgreSQL, you can use window functions directly in WHERE clause with some limitations. More commonly, you'd use a subquery:
SELECT * FROM (
SELECT
l.arrondissement,
tn.name AS tree_name,
t.height,
RANK() OVER(PARTITION BY l.arrondissement ORDER BY t.height DESC) AS height_rank
FROM trees t
JOIN locations l ON t.location_id = l.id
JOIN taxonomy tax ON t.taxonomy_id = tax.id
LEFT JOIN tree_names tn ON tax.name_id = tn.id
WHERE t.height > 0 AND l.arrondissement IS NOT NULL
) ranked
WHERE height_rank <= 5
ORDER BY arrondissement, height_rank;
Key Insight: This gives you exactly 5 trees per arrondissement (or fewer if ties exist with RANK).
Why: Understanding when to use ROW_NUMBER vs RANK is crucial. Let's see them side-by-side with trees that have identical heights.
Goal: Find trees with height = 20 meters (a common height) and compare how ROW_NUMBER and RANK behave.
Hint:
Query Structure:
SELECT
t.id,
tn.name,
t.height,
ROW_NUMBER() OVER(ORDER BY ...) AS row_num,
RANK() OVER(ORDER BY ...) AS rank_num,
DENSE_RANK() OVER(ORDER BY ...) AS dense_rank_num
FROM trees t
JOIN taxonomy tax ON ...
LEFT JOIN tree_names tn ON ...
WHERE t.height = 20
LIMIT 30;
Key Insight:
Since all have height=20, we order by circumference as a tiebreaker to see the differences.
Why: Which tree species are most common in Paris? Let's rank species by count across the entire city.
Goal: Count trees per species and rank species by popularity.
Hint:
Query Structure:
SELECT DISTINCT
ts.species,
COUNT(*) OVER(PARTITION BY ...) AS tree_count,
RANK() OVER(ORDER BY COUNT(*) OVER(...) DESC) AS popularity_rank
FROM trees t
JOIN taxonomy tax ON ...
JOIN tree_species ts ON ...
WHERE ts.species IS NOT NULL
ORDER BY popularity_rank
LIMIT 20;
Key Insight: You'll see which species dominate Paris's urban forest. The top species likely make up a large percentage of all trees.
Why: When trees are ordered by height, what are the gaps between sizes? Are there natural groupings or sudden jumps?
Goal: Order trees by height and show the difference from the previous tree's height.
Hint:
Query Structure:
SELECT
id,
height,
LAG(height) OVER(ORDER BY height) AS previous_height,
height - LAG(...) AS height_jump
FROM trees
WHERE height > 5
ORDER BY height
LIMIT 100;
Key Insight: You can see how height increases gradually. Large jumps might indicate measurement groupings or data entry patterns.
Why: Within a species, how do individual trees compare? Is there one exceptionally tall specimen?
Goal: For a specific species, show each tree and how it compares to the next smaller and larger tree of the same species.
Hint:
Query Structure:
SELECT
t.id,
ts.species,
t.height,
LAG(t.height) OVER(PARTITION BY ... ORDER BY t.height DESC) AS taller_tree_height,
LEAD(t.height) OVER(PARTITION BY ... ORDER BY t.height DESC) AS shorter_tree_height
FROM trees t
JOIN taxonomy tax ON ...
JOIN tree_species ts ON ...
WHERE ts.species = 'platanifolia' -- or another common species
AND t.height > 0
ORDER BY t.height DESC
LIMIT 50;
Key Insight: You can see the size range within a species and identify champion trees.
Why: The first row in each partition has no previous row. Let's handle this gracefully using LAG's default parameter.
Goal: Compare each tree to the previous tree in its arrondissement, but for the first tree, use its own height as default.
Hint:
Query Structure:
SELECT
l.arrondissement,
t.id,
t.height,
LAG(t.height, 1, t.height) OVER(PARTITION BY ... ORDER BY ...) AS prev_height_or_self,
t.height - LAG(t.height, 1, t.height) OVER(...) AS height_difference
FROM trees t
JOIN locations l ON ...
WHERE t.height > 0 AND l.arrondissement IS NOT NULL
ORDER BY l.arrondissement, t.height DESC
LIMIT 100;
Key Insight: The tallest tree in each arrondissement shows height_difference = 0 (comparing to itself). All others show the gap to the next taller tree.
Why: Some trees are marked as "remarkable" - presumably due to age, size, or historical significance. Are remarkable trees actually bigger?
Goal: Compare statistics between remarkable and non-remarkable trees within each arrondissement.
Hint:
Query Structure:
SELECT DISTINCT
l.arrondissement,
t.remarkable,
COUNT(*) OVER(PARTITION BY l.arrondissement, t.remarkable) AS count_in_group,
ROUND(AVG(t.height) OVER(PARTITION BY l.arrondissement, t.remarkable)::numeric, 2) AS avg_height_group,
ROUND(AVG(t.height) OVER(PARTITION BY l.arrondissement)::numeric, 2) AS avg_height_arrondissement
FROM trees t
JOIN locations l ON t.location_id = l.id
WHERE t.height > 0 AND l.arrondissement IS NOT NULL
ORDER BY l.arrondissement, t.remarkable DESC;
Key Insight: You can see if remarkable trees are indeed larger than average trees in their district. The percentage difference shows the magnitude.
Why: Different domains (parks, streets, gardens) might have trees at different life stages. Understanding this helps with urban planning.
Goal: For each domain-stage combination, show tree counts and typical sizes.
Hint:
Query Structure:
SELECT
td.domain,
COALESCE(ts.stage, 'Unknown') AS stage,
t.height,
COUNT(*) OVER(PARTITION BY t.domain_id, t.stage_id) AS trees_in_group,
ROUND(AVG(t.height) OVER(PARTITION BY t.domain_id, t.stage_id)::numeric, 2) AS avg_height_group,
PERCENT_RANK() OVER(PARTITION BY t.domain_id, t.stage_id ORDER BY t.height) AS percentile_in_group
FROM trees t
LEFT JOIN tree_domains td ON ...
LEFT JOIN tree_stages ts ON ...
WHERE t.height > 0
ORDER BY td.domain, ts.stage, t.height DESC
LIMIT 200;
Key Insight: You'll see different patterns - for example, "Adulte" trees in parks might be larger than "Adulte" trees on streets due to space constraints.
Why: Biodiversity is important for urban ecosystems. Which arrondissements have the most diverse tree populations?
Goal: Count distinct species per arrondissement and rank arrondissements by diversity.
Hint:
Query Structure:
-- First, count trees per species per arrondissement
-- Then aggregate to show species count per arrondissement
SELECT DISTINCT
l.arrondissement,
COUNT(DISTINCT tax.species_id) OVER(PARTITION BY l.arrondissement) AS species_diversity,
COUNT(*) OVER(PARTITION BY l.arrondissement) AS total_trees,
RANK() OVER(ORDER BY COUNT(DISTINCT tax.species_id) OVER(PARTITION BY l.arrondissement) DESC) AS diversity_rank
FROM trees t
JOIN locations l ON ...
JOIN taxonomy tax ON ...
WHERE l.arrondissement IS NOT NULL
AND tax.species_id IS NOT NULL
ORDER BY diversity_rank;
This is challenging with pure window functions. Here's a practical approach:
-- Using a subquery approach since COUNT(DISTINCT) doesn't work in window functions
SELECT
arrondissement,
species_count,
total_trees,
RANK() OVER(ORDER BY species_count DESC) AS diversity_rank,
ROUND(species_count::numeric / total_trees * 100, 2) AS species_per_100_trees
FROM (
SELECT
l.arrondissement,
COUNT(DISTINCT tax.species_id) AS species_count,
COUNT(*) AS total_trees
FROM trees t
JOIN locations l ON t.location_id = l.id
JOIN taxonomy tax ON t.taxonomy_id = tax.id
WHERE l.arrondissement IS NOT NULL
AND tax.species_id IS NOT NULL
GROUP BY l.arrondissement
) diversity_stats
ORDER BY diversity_rank;
Alternative using window functions more directly:
SELECT DISTINCT
l.arrondissement,
COUNT(*) OVER(PARTITION BY l.arrondissement) AS total_trees,
-- Count unique species by counting distinct species_id appearances
(SELECT COUNT(DISTINCT tax2.species_id)
FROM trees t2
JOIN locations l2 ON t2.location_id = l2.id
JOIN taxonomy tax2 ON t2.taxonomy_id = tax2.id
WHERE l2.arrondissement = l.arrondissement
AND tax2.species_id IS NOT NULL
) AS species_count,
RANK() OVER(ORDER BY
(SELECT COUNT(DISTINCT tax2.species_id)
FROM trees t2
JOIN locations l2 ON t2.location_id = l2.id
JOIN taxonomy tax2 ON t2.taxonomy_id = tax2.id
WHERE l2.arrondissement = l.arrondissement
AND tax2.species_id IS NOT NULL
) DESC
) AS diversity_rank
FROM trees t
JOIN locations l ON t.location_id = l.id
JOIN taxonomy tax ON t.taxonomy_id = tax.id
WHERE l.arrondissement IS NOT NULL
ORDER BY diversity_rank;
Key Insight: This shows the limitation of window functions - they can't do COUNT(DISTINCT) easily. The first solution using GROUP BY in a subquery is cleaner. This demonstrates when NOT to force window functions!
Why: Urban planners need to know which tree genres (like Platanus, Acer, Tilia) thrive in different environments (parks vs streets).
Goal: For each genre-domain combination, calculate average size and compare it to the genre's overall average.
Hint:
Query Structure:
SELECT DISTINCT
tg.genre,
td.domain,
COUNT(*) OVER(PARTITION BY tax.genre_id, t.domain_id) AS trees_in_combo,
ROUND(AVG(t.height) OVER(PARTITION BY tax.genre_id, t.domain_id)::numeric, 2) AS avg_height_combo,
ROUND(AVG(t.height) OVER(PARTITION BY tax.genre_id)::numeric, 2) AS avg_height_genre_overall,
-- Calculate percentage difference
FROM trees t
JOIN taxonomy tax ON ...
LEFT JOIN tree_genres tg ON ...
LEFT JOIN tree_domains td ON ...
WHERE t.height > 0
AND tg.genre IS NOT NULL
AND td.domain IS NOT NULL
ORDER BY tg.genre, avg_height_combo DESC;
Key Insight: You'll see that some tree genres grow much better in parks than on streets, showing positive percentage differences. This is actionable data for urban planning.
Why: Instead of looking at individual tree heights, let's smooth the data to see trends. A moving average of 100 trees shows the general pattern.
Goal: Calculate a 100-tree moving average of heights to see how tree size changes across the ordered dataset.
Hint:
Query Structure:
SELECT
id,
height,
ROUND(AVG(height) OVER(
ORDER BY height
ROWS BETWEEN 49 PRECEDING AND 50 FOLLOWING
)::numeric, 2) AS moving_avg_100,
ROW_NUMBER() OVER(ORDER BY height) AS position
FROM trees
WHERE height > 0
-- Sample evenly across distribution
ORDER BY height
LIMIT 1000;
Key Insight: The moving average smooths out individual variations and shows the overall distribution curve. Trees far from the moving average are unusual for their size category.
Why: What percentage of trees are shorter than 10 meters? Than 20 meters? A cumulative distribution answers this.
Goal: For each height value, calculate what percentage of trees are at or below that height.
Hint:
Query Structure:
SELECT
height,
COUNT(*) OVER(
ORDER BY height
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS trees_up_to_height,
COUNT(*) OVER() AS total_trees,
ROUND(
COUNT(*) OVER(ORDER BY height ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)::numeric /
COUNT(*) OVER() * 100,
2
) AS cumulative_pct
FROM trees
WHERE height > 0
ORDER BY height;
Key Insight: This shows you can answer questions like "50% of Paris trees are shorter than X meters". Very useful for understanding the dataset distribution.
Why: We want to understand tree size distribution at multiple levels: overall, by domain, and by arrondissement.
Goal: Assign each tree to a quartile in three different contexts simultaneously.
Hint:
Query Structure:
SELECT
t.id,
l.arrondissement,
td.domain,
t.height,
NTILE(4) OVER(ORDER BY t.height) AS overall_quartile,
NTILE(4) OVER(PARTITION BY t.domain_id ORDER BY t.height) AS domain_quartile,
NTILE(4) OVER(PARTITION BY l.arrondissement ORDER BY t.height) AS arr_quartile
FROM trees t
JOIN locations l ON ...
LEFT JOIN tree_domains td ON ...
WHERE t.height > 0
AND l.arrondissement IS NOT NULL
ORDER BY t.height DESC
LIMIT 100;
Key Insight: A tree might be in the top quartile citywide but only average in its domain (e.g., a tall street tree that's average for a park). This multi-dimensional view reveals context-dependent rankings.
Why: Some species might have one exceptional specimen that's much larger than all others of its kind. These "champion trees" are botanically interesting.
Goal: For each species, find trees that are exceptionally tall compared to others of their species.
Hint:
Query Structure:
SELECT
ts.species,
t.id,
t.height,
RANK() OVER(PARTITION BY tax.species_id ORDER BY t.height DESC) AS rank_in_species,
MAX(t.height) OVER(PARTITION BY tax.species_id) AS tallest_in_species,
t.height - LAG(t.height) OVER(PARTITION BY tax.species_id ORDER BY t.height DESC) AS gap_to_next_tallest,
ROUND(AVG(t.height) OVER(PARTITION BY tax.species_id)::numeric, 2) AS species_avg_height
FROM trees t
JOIN taxonomy tax ON ...
JOIN tree_species ts ON ...
WHERE t.height > 0
AND ts.species IS NOT NULL
ORDER BY gap_to_next_tallest DESC NULLS LAST
LIMIT 50;
Key Insight: This identifies remarkable individual trees that are far superior to others of their species - potential candidates for heritage tree status or special protection.
After completing these exercises, consider:
When would GROUP BY be better than window functions?
What are the limitations of window functions you discovered?
How do NULL values affect window function results?
Performance considerations:
Try these on your own:
Density analysis: Trees per square kilometer by arrondissement (you'll need to research arrondissement sizes)
Age estimation: If height correlates with age, estimate relative ages within species
Maintenance prioritization: Identify arrondissements with many old (large circumference) trees that might need more care
Biodiversity scoring: Create a composite score combining species diversity, size distribution, and remarkable tree density
Climate adaptation: Which species have the most consistent growth (low standard deviation) vs highly variable growth - might indicate climate sensitivity