We'll work on the normalized version of the trees db
download the sql backup from https://skatai.com/assets/db/treesdb_v03_normalized.sql
restore the database
from the terminal
psql -d postgres -f <path to the sql backup>/treesdb_v03_normalized.sql
or within a PSQL session
\i <path to the sql backup>/treesdb_v03_normalized.sql
This will create a new database called treesdb_v03_normalized
Check out the tables, make a few queries to get a sense of the data.
The ERD is
This comprehensive series of exercises will help you master window functions in PostgreSQL using a real-world dataset of trees in Paris.
important: all results are obtained after ordering by random() and limiting to a few rows. all your queries should also be ordered by random() and limited to a few rows.
Goal: For each tree, show its arrondissement and the total count of trees in that arrondissement.
Concepts: Basic COUNT() window function with OVER(PARTITION BY)
Step-by-step guide:
trees tablelocations to get arrondissement informationCOUNT(*) OVER (PARTITION BY arrondissement) to count trees per arrondissementHints:
GROUP BY doesPARTITION BY clause divides the data into groups for the window functionFor comparison - GROUP BY approach:
results
id | arrondissement | trees_in_arrondissement
--------+-------------------+-------------------------
108631 | PARIS 16E ARRDT | 17128
208989 | PARIS 1ER ARRDT | 1645
18713 | PARIS 13E ARRDT | 17462
159547 | PARIS 19E ARRDT | 15235
Solution:
SELECT
t.id,
l.arrondissement,
COUNT(*) OVER (PARTITION BY l.arrondissement) as trees_in_arrondissement
FROM trees t
JOIN locations l ON t.location_id = l.id
WHERE l.arrondissement IS NOT NULL
ORDER BY trees_in_arrondissement DESC, l.arrondissement;
SELECT
l.arrondissement,
COUNT(*) as tree_count
FROM trees t
JOIN locations l ON t.location_id = l.id
WHERE l.arrondissement IS NOT NULL
GROUP BY l.arrondissement
ORDER BY tree_count DESC;
Goal: For each tree, show how its height compares to the maximum height in its arrondissement.
Concepts: MAX() window function, comparing values to aggregates
Step-by-step guide:
trees and locations tablesMAX(height) OVER (PARTITION BY arrondissement) to find the tallest tree per areaHints:
height - MAX(height) OVER (...) tells youto get a better view of the results, you can limit the number of rows returned and order by random()
ORDER BY random() <other order by> limit 20;
Solution:
SELECT
t.id,
l.arrondissement,
t.height,
MAX(t.height) OVER (PARTITION BY l.arrondissement) as max_height_in_area,
t.height - MAX(t.height) OVER (PARTITION BY l.arrondissement) as diff_from_max
FROM trees t
JOIN locations l ON t.location_id = l.id
WHERE t.height IS NOT NULL
AND l.arrondissement IS NOT NULL
ORDER BY random() , l.arrondissement, diff_from_max DESC limit 20;
results
id | arrondissement | height | max_height_in_area | diff_from_max
--------+-------------------+--------+--------------------+---------------
5422 | PARIS 11E ARRDT | 35 | 35 | 0
133998 | PARIS 20E ARRDT | 15 | 30 | -15
61004 | BOIS DE VINCENNES | 10 | 120 | -110
86518 | PARIS 20E ARRDT | 12 | 30 | -18
Here's the corrected query for Exercise 3:
Goal: Identify genres with missing height data by showing the count and percentage of trees with missing (0) heights.
Concepts: COUNT() with conditions, calculating percentages with window functions
Step-by-step guide:
COUNT(*) OVER (PARTITION BY genre) for total trees per genreCOUNT(CASE WHEN height > 0 THEN 1 END) OVER (PARTITION BY genre) for trees with valid heightsHints:
COUNT(CASE WHEN height > 0 THEN 1 END) only counts trees with actual height valuesCOUNT(*) counts all rowsDISTINCT in the outer query to avoid repeated calculationsSolution:
SELECT DISTINCT
tg.genre,
COUNT(*) OVER (PARTITION BY tg.genre) as total_trees,
COUNT(CASE WHEN t.height > 0 THEN 1 END) OVER (PARTITION BY tg.genre) as trees_with_height,
COUNT(CASE WHEN t.height = 0 THEN 1 END) OVER (PARTITION BY tg.genre) as missing_height,
ROUND(100.0 * COUNT(CASE WHEN t.height = 0 THEN 1 END) OVER (PARTITION BY tg.genre) /
COUNT(*) OVER (PARTITION BY tg.genre), 2) as pct_missing
FROM trees t
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_genres tg ON tx.genre_id = tg.id
ORDER BY pct_missing DESC;
Alternative approach (more explicit):
SELECT DISTINCT
tg.genre,
COUNT(*) OVER (PARTITION BY tg.genre) as total_trees,
COUNT(*) OVER (PARTITION BY tg.genre) - COUNT(CASE WHEN t.height = 0 THEN 1 END) OVER (PARTITION BY tg.genre) as trees_with_height,
COUNT(CASE WHEN t.height = 0 THEN 1 END) OVER (PARTITION BY tg.genre) as missing_height,
ROUND(100.0 * COUNT(CASE WHEN t.height = 0 THEN 1 END) OVER (PARTITION BY tg.genre) /
COUNT(*) OVER (PARTITION BY tg.genre), 2) as pct_missing
FROM trees t
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_genres tg ON tx.genre_id = tg.id
ORDER BY pct_missing DESC;
results
genre | total_trees | trees_with_height | missing_height | pct_missing
-------------------+-------------+-------------------+----------------+-------------
Rhododendron | 8 | 0 | 8 | 100.00
Podocarpus | 2 | 0 | 2 | 100.00
Tsuga | 1 | 0 | 1 | 100.00
Cephalotaxus | 1 | 0 | 1 | 100.00
Juniperus | 5 | 0 | 5 | 100.00
Cercidiphyllum | 8 | 1 | 7 | 87.50
Note: a SQL query with
GROUP BYwould be simpler,
For instance
SELECT
tg.genre,
COUNT(*) as total_trees,
COUNT(CASE WHEN t.height > 0 THEN 1 END) as trees_with_height,
COUNT(CASE WHEN t.height = 0 THEN 1 END) as missing_height,
ROUND(100.0 * COUNT(CASE WHEN t.height = 0 THEN 1 END) / COUNT(*), 2) as pct_missing
FROM trees t
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_genres tg ON tx.genre_id = tg.id
GROUP BY tg.genre
ORDER BY pct_missing DESC;
Goal: Assign a sequential ranking to trees by height within each domain.
Concepts: ROW_NUMBER(), ordering within partitions
Step-by-step guide:
ROW_NUMBER() OVER (PARTITION BY domain ORDER BY height DESC)Hints:
ROW_NUMBER() always gives unique sequential numbers (1, 2, 3, ...)ORDER BY height DESC puts tallest trees firstQuery Skeleton:
SELECT * FROM (
SELECT
...,
ROW_NUMBER() OVER (...) as rn
FROM ...
WHERE ...
) subquery
WHERE rn <= 5;
Solution:
SELECT * FROM (
SELECT
t.id,
td.domain,
t.height,
ROW_NUMBER() OVER (PARTITION BY td.domain ORDER BY t.height DESC) as height_rank
FROM trees t
JOIN tree_domains td ON t.domain_id = td.id
WHERE t.height IS NOT NULL
AND td.domain IS NOT NULL
) ranked
WHERE height_rank <= 5
ORDER BY domain, height_rank;
results
id | domain | height | height_rank
--------+--------------+--------+-------------
172122 | Alignement | 2524 | 1
152402 | Alignement | 2019 | 2
166842 | Alignement | 720 | 3
168045 | Alignement | 225 | 4
93247 | Alignement | 200 | 5
88686 | CIMETIERE | 119 | 1
91418 | CIMETIERE | 116 | 2
91576 | CIMETIERE | 35 | 3
Goal: Compare RANK() and DENSE_RANK() behavior when trees have the same circumference within each arrondissement.
Concepts: RANK(), DENSE_RANK(), understanding ties
Step-by-step guide:
RANK() and DENSE_RANK() with the same orderingHints:
RANK() leaves gaps after ties: 1, 2, 2, 4, 5DENSE_RANK() has no gaps: 1, 2, 2, 3, 4Solution:
SELECT
t.id,
l.arrondissement,
t.circumference,
RANK() OVER (PARTITION BY l.arrondissement ORDER BY t.circumference DESC) as rank_with_gaps,
DENSE_RANK() OVER (PARTITION BY l.arrondissement ORDER BY t.circumference DESC) as dense_rank_no_gaps,
RANK() OVER (PARTITION BY l.arrondissement ORDER BY t.circumference DESC) -
DENSE_RANK() OVER (PARTITION BY l.arrondissement ORDER BY t.circumference DESC) as gap_size
FROM trees t
JOIN locations l ON t.location_id = l.id
WHERE t.circumference IS NOT NULL
AND l.arrondissement IS NOT NULL
ORDER BY l.arrondissement, t.circumference DESC
LIMIT 100;
Goal: Identify unusually tall or short trees by comparing them to statistical measures within their species.
Concepts: AVG(), STDDEV() window functions, statistical outlier detection
Step-by-step guide:
AVG(height) and STDDEV(height) per species using window functions(height - avg) / stddevHints:
ABS() function gets absolute valueQuery Skeleton:
SELECT * FROM (
SELECT
...,
AVG(...) OVER (PARTITION BY ...) as avg_height,
STDDEV(...) OVER (PARTITION BY ...) as stddev_height,
(height - AVG(...) OVER (...)) / STDDEV(...) OVER (...) as z_score
FROM ...
WHERE ...
) stats
WHERE ABS(z_score) > 2
AND stddev_height IS NOT NULL;
Solution:
SELECT * FROM (
SELECT
t.id,
ts.species,
t.height,
AVG(t.height) OVER (PARTITION BY ts.species) as avg_species_height,
STDDEV(t.height) OVER (PARTITION BY ts.species) as stddev_species_height,
COUNT(*) OVER (PARTITION BY ts.species) as species_count,
(t.height - AVG(t.height) OVER (PARTITION BY ts.species)) /
NULLIF(STDDEV(t.height) OVER (PARTITION BY ts.species), 0) as z_score
FROM trees t
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_species ts ON tx.species_id = ts.id
WHERE t.height IS NOT NULL
) outlier_analysis
WHERE ABS(z_score) > 2
AND stddev_species_height IS NOT NULL
AND species_count >= 30
ORDER BY ABS(z_score) DESC;
Goal: Calculate what percentage of trees in each domain have a smaller diameter than each tree.
Concepts: PERCENT_RANK(), CUME_DIST(), understanding percentiles
Step-by-step guide:
PERCENT_RANK() to get relative rank (0 to 1)CUME_DIST() to get cumulative distributionHints:
PERCENT_RANK() returns 0 for the smallest value, 1 for the largestCUME_DIST() returns the fraction of rows <= current rowSolution:
SELECT
t.id,
td.domain,
t.diameter,
PERCENT_RANK() OVER (PARTITION BY td.domain ORDER BY t.diameter) as pct_rank,
ROUND(100 * PERCENT_RANK() OVER (PARTITION BY td.domain ORDER BY t.diameter), 2) as percentile,
CUME_DIST() OVER (PARTITION BY td.domain ORDER BY t.diameter) as cum_dist,
ROUND(100 * CUME_DIST() OVER (PARTITION BY td.domain ORDER BY t.diameter), 2) as pct_at_or_below
FROM trees t
JOIN tree_domains td ON t.domain_id = td.id
WHERE t.diameter IS NOT NULL
AND td.domain IS NOT NULL
ORDER BY td.domain, t.diameter DESC
LIMIT 200;
Goal: Create a two-level ranking showing the tallest trees of each species within each arrondissement.
Concepts: Multiple window functions, complex partitioning
Step-by-step guide:
Hints:
OVER() clauses with different PARTITION BY conditionsPARTITION BY arrondissement, speciesPARTITION BY arrondissementQuery Skeleton:
SELECT * FROM (
SELECT
...,
ROW_NUMBER() OVER (PARTITION BY arr, species ORDER BY height DESC) as rank_in_species,
ROW_NUMBER() OVER (PARTITION BY arr ORDER BY height DESC) as rank_in_area
FROM ...
JOIN ... ON ...
WHERE ...
) ranked
WHERE rank_in_species <= 3;
Solution:
SELECT * FROM (
SELECT
t.id,
l.arrondissement,
ts.species,
t.height,
ROW_NUMBER() OVER (PARTITION BY l.arrondissement, ts.species
ORDER BY t.height DESC) as rank_within_species,
ROW_NUMBER() OVER (PARTITION BY l.arrondissement
ORDER BY t.height DESC) as rank_in_arrondissement,
COUNT(*) OVER (PARTITION BY l.arrondissement, ts.species) as species_count_in_area
FROM trees t
JOIN locations l ON t.location_id = l.id
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_species ts ON tx.species_id = ts.id
WHERE t.height IS NOT NULL
AND l.arrondissement IS NOT NULL
AND ts.species IS NOT NULL
) ranked_trees
WHERE rank_within_species <= 3
AND species_count_in_area >= 10
ORDER BY arrondissement, species, rank_within_species;
Goal: Analyze the distribution of the remarkable flag, showing how imbalanced it is across different domains and stages.
Concepts: Conditional aggregation with window functions, handling boolean/NULL values
Step-by-step guide:
remarkable = trueremarkable IS NULLDISTINCT to avoid row repetitionHints:
COUNT(CASE WHEN remarkable = true THEN 1 END) counts only TRUE valuesCOUNT(CASE WHEN remarkable IS NULL THEN 1 END) counts NULLsQuery Skeleton:
SELECT DISTINCT
domain,
stage,
COUNT(*) OVER (PARTITION BY ...) as total,
COUNT(CASE WHEN ... THEN 1 END) OVER (PARTITION BY ...) as remarkable_count,
COUNT(CASE WHEN ... IS NULL THEN 1 END) OVER (PARTITION BY ...) as null_count,
-- percentages
FROM ...
ORDER BY ...;
Solution:
SELECT DISTINCT
td.domain,
ts.stage,
COUNT(*) OVER (PARTITION BY td.domain, ts.stage) as total_trees,
COUNT(CASE WHEN t.remarkable = true THEN 1 END) OVER (PARTITION BY td.domain, ts.stage) as remarkable_count,
COUNT(CASE WHEN t.remarkable IS NULL THEN 1 END) OVER (PARTITION BY td.domain, ts.stage) as null_count,
ROUND(100.0 * COUNT(CASE WHEN t.remarkable = true THEN 1 END) OVER (PARTITION BY td.domain, ts.stage) /
COUNT(*) OVER (PARTITION BY td.domain, ts.stage), 2) as pct_remarkable,
ROUND(100.0 * COUNT(CASE WHEN t.remarkable IS NULL THEN 1 END) OVER (PARTITION BY td.domain, ts.stage) /
COUNT(*) OVER (PARTITION BY td.domain, ts.stage), 2) as pct_null
FROM trees t
LEFT JOIN tree_domains td ON t.domain_id = td.id
LEFT JOIN tree_stages ts ON t.stage_id = ts.id
ORDER BY total_trees DESC;
Goal: Create a running total showing how many trees exist at or below each circumference threshold within each genus.
Concepts: SUM() with window frames, running totals, ROWS BETWEEN
Step-by-step guide:
SUM(1) OVER (PARTITION BY genus ORDER BY circumference ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)Hints:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means "from start to here"SUM(1) is the same as COUNT(*)Query Skeleton:
SELECT
...,
SUM(1) OVER (PARTITION BY ...
ORDER BY ...
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total
FROM ...
WHERE ...
ORDER BY ...;
Solution:
SELECT
t.id,
tg.genre,
t.circumference,
ROUND(t.circumference / 10.0) * 10 as circumference_bucket,
SUM(1) OVER (PARTITION BY tg.genre
ORDER BY t.circumference
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total,
COUNT(*) OVER (PARTITION BY tg.genre) as total_in_genus,
ROUND(100.0 * SUM(1) OVER (PARTITION BY tg.genre
ORDER BY t.circumference
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
COUNT(*) OVER (PARTITION BY tg.genre), 2) as pct_at_or_below
FROM trees t
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_genres tg ON tx.genre_id = tg.id
WHERE t.circumference IS NOT NULL
AND tg.genre IS NOT NULL
ORDER BY tg.genre, t.circumference
LIMIT 500;
Goal: For trees ordered by height within each species, compare each tree to the previous and next tree.
Concepts: LAG(), LEAD(), accessing adjacent rows
Step-by-step guide:
LAG(height) OVER (PARTITION BY species ORDER BY height) to get previous tree's heightLEAD(height) OVER (PARTITION BY species ORDER BY height) to get next tree's heightHints:
LAG(column, offset) looks back offset rows (default is 1)LEAD(column, offset) looks forward offset rowsSolution:
SELECT * FROM (
SELECT
t.id,
ts.species,
t.height,
LAG(t.height) OVER (PARTITION BY ts.species ORDER BY t.height) as prev_height,
LEAD(t.height) OVER (PARTITION BY ts.species ORDER BY t.height) as next_height,
t.height - LAG(t.height) OVER (PARTITION BY ts.species ORDER BY t.height) as gap_from_prev,
LEAD(t.height) OVER (PARTITION BY ts.species ORDER BY t.height) - t.height as gap_to_next,
COUNT(*) OVER (PARTITION BY ts.species) as total_species_trees
FROM trees t
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_species ts ON tx.species_id = ts.id
WHERE t.height IS NOT NULL
AND ts.species IS NOT NULL
) gaps
WHERE total_species_trees >= 50
AND (gap_from_prev > 5 OR gap_to_next > 5)
ORDER BY species, height;
Goal: Analyze the confusing stage labels by showing how stage distribution varies across domains and highlighting unusual patterns.
Concepts: Multiple window functions, conditional logic, complex filtering
Step-by-step guide:
Hints:
Query Skeleton:
SELECT DISTINCT
domain,
stage,
COUNT(*) OVER (PARTITION BY domain, stage) as count_domain_stage,
COUNT(*) OVER (PARTITION BY domain) as count_domain,
COUNT(*) OVER (PARTITION BY stage) as count_stage,
-- percentages and other metrics
FROM ...
ORDER BY ...;
Solution:
SELECT DISTINCT
td.domain,
COALESCE(ts.stage, 'NULL/Missing') as stage,
COUNT(*) OVER (PARTITION BY td.domain, ts.stage) as trees_in_domain_stage,
COUNT(*) OVER (PARTITION BY td.domain) as total_in_domain,
COUNT(*) OVER (PARTITION BY ts.stage) as total_in_stage,
ROUND(100.0 * COUNT(*) OVER (PARTITION BY td.domain, ts.stage) /
COUNT(*) OVER (PARTITION BY td.domain), 2) as pct_of_domain,
ROUND(100.0 * COUNT(*) OVER (PARTITION BY td.domain, ts.stage) /
COUNT(*) OVER (), 2) as pct_of_all_trees,
COUNT(DISTINCT ts.stage) OVER (PARTITION BY td.domain) as distinct_stages_in_domain
FROM trees t
LEFT JOIN tree_domains td ON t.domain_id = td.id
LEFT JOIN tree_stages ts ON t.stage_id = ts.id
WHERE td.domain IS NOT NULL
ORDER BY td.domain, trees_in_domain_stage DESC;
Goal: For each arrondissement, find the top 3 most common tree varieties and provide comprehensive statistics including rankings, percentages, and comparisons.
Concepts: Everything combined - multiple window functions, rankings, aggregations, complex joins
Step-by-step guide:
Hints:
DENSE_RANK() to handle ties appropriatelyQuery Skeleton:
SELECT * FROM (
SELECT
arrondissement,
variety,
COUNT(*) OVER (PARTITION BY arrondissement, variety) as variety_count,
DENSE_RANK() OVER (PARTITION BY arrondissement ORDER BY COUNT(*) OVER (PARTITION BY arrondissement, variety) DESC) as variety_rank,
AVG(...) OVER (PARTITION BY arrondissement, variety) as avg_stat,
COUNT(*) OVER (PARTITION BY arrondissement) as total_in_arrond,
-- more statistics
FROM trees t
JOIN ... ON ...
WHERE ...
) ranked_varieties
WHERE variety_rank <= 3
ORDER BY ...;
Solution:
SELECT * FROM (
SELECT DISTINCT
l.arrondissement,
tv.variety,
COUNT(*) OVER (PARTITION BY l.arrondissement, tv.variety) as variety_count,
DENSE_RANK() OVER (PARTITION BY l.arrondissement
ORDER BY COUNT(*) OVER (PARTITION BY l.arrondissement, tv.variety) DESC) as variety_rank,
COUNT(*) OVER (PARTITION BY l.arrondissement) as total_trees_in_arrond,
ROUND(100.0 * COUNT(*) OVER (PARTITION BY l.arrondissement, tv.variety) /
COUNT(*) OVER (PARTITION BY l.arrondissement), 2) as pct_of_arrondissement,
ROUND(AVG(t.height) OVER (PARTITION BY l.arrondissement, tv.variety), 1) as avg_height,
MAX(t.circumference) OVER (PARTITION BY l.arrondissement, tv.variety) as max_circumference,
COUNT(CASE WHEN t.remarkable = true THEN 1 END) OVER (PARTITION BY l.arrondissement, tv.variety) as remarkable_count
FROM trees t
JOIN locations l ON t.location_id = l.id
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_varieties tv ON tx.variety_id = tv.id
WHERE l.arrondissement IS NOT NULL
AND tv.variety IS NOT NULL
) ranked_varieties
WHERE variety_rank <= 3
ORDER BY arrondissement, variety_rank;
Goal: Divide trees within each genus into 4 equal groups (quartiles) based on height to understand distribution.
Concepts: NTILE(), quantile analysis
Step-by-step guide:
NTILE(4) to divide trees into 4 groupsHints:
NTILE(n) divides rows into n approximately equal groupsSolution:
SELECT DISTINCT
tg.genre,
quartile,
MIN(height) OVER (PARTITION BY tg.genre, quartile) as min_height_in_quartile,
MAX(height) OVER (PARTITION BY tg.genre, quartile) as max_height_in_quartile,
ROUND(AVG(height) OVER (PARTITION BY tg.genre, quartile), 2) as avg_height_in_quartile,
COUNT(*) OVER (PARTITION BY tg.genre, quartile) as trees_in_quartile
FROM (
SELECT
t.*,
tg.genre,
t.height,
NTILE(4) OVER (PARTITION BY tg.genre ORDER BY t.height) as quartile
FROM trees t
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_genres tg ON tx.genre_id = tg.id
WHERE t.height IS NOT NULL
AND tg.genre IS NOT NULL
) quartile_data
ORDER BY genre, quartile;
COUNT(), SUM(), AVG(), MAX(), MIN(), STDDEV()ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE()PERCENT_RANK(), CUME_DIST()LAG(), LEAD()COUNT(CASE WHEN ... END)ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWEach exercise builds on the previous ones, introducing new concepts while reinforcing earlier ones. Practice these queries, experiment with different window function combinations, and try creating your own variations!