We'll work on the normalized version of the trees db
download the sql backup from https://skatai.com/assets/db/treesdb_v03_normalized.sql
restore the database
from the terminal
psql -d postgres -f <path to the sql backup>/treesdb_v03_normalized.sql
or within a PSQL session
\i <path to the sql backup>/treesdb_v03_normalized.sql
This will create a new database called treesdb_v03_normalized
Check out the tables, make a few queries to get a sense of the data.
The ERD is
This comprehensive series of exercises will help you master window functions in PostgreSQL using a real-world dataset of trees in Paris.
important: all results are obtained after ordering by random() and limiting to a few rows. all your queries should also be ordered by random() and limited to a few rows.
Goal: For each tree, show its arrondissement and the total count of trees in that arrondissement.
Concepts: Basic COUNT() window function with OVER(PARTITION BY)
Step-by-step guide:
trees tablelocations to get arrondissement informationCOUNT(*) OVER (PARTITION BY arrondissement) to count trees per arrondissementHints:
GROUP BY doesPARTITION BY clause divides the data into groups for the window functionFor comparison - GROUP BY approach:
SELECT
l.arrondissement,
COUNT(*) as tree_count
FROM trees t
JOIN locations l ON t.location_id = l.id
WHERE l.arrondissement IS NOT NULL
GROUP BY l.arrondissement
ORDER BY tree_count DESC;
results
id | arrondissement | trees_in_arrondissement
--------+-------------------+-------------------------
108631 | PARIS 16E ARRDT | 17128
208989 | PARIS 1ER ARRDT | 1645
18713 | PARIS 13E ARRDT | 17462
159547 | PARIS 19E ARRDT | 15235
Goal: For each tree, show how its height compares to the maximum height in its arrondissement.
Concepts: MAX() window function, comparing values to aggregates
Step-by-step guide:
trees and locations tablesMAX(height) OVER (PARTITION BY arrondissement) to find the tallest tree per areaHints:
height - MAX(height) OVER (...) tells youto get a better view of the results, you can limit the number of rows returned and order by random()
ORDER BY random() <other order by> limit 20;
results
id | arrondissement | height | max_height_in_area | diff_from_max
--------+-------------------+--------+--------------------+---------------
5422 | PARIS 11E ARRDT | 35 | 35 | 0
133998 | PARIS 20E ARRDT | 15 | 30 | -15
61004 | BOIS DE VINCENNES | 10 | 120 | -110
86518 | PARIS 20E ARRDT | 12 | 30 | -18
Here's the corrected query for Exercise 3:
Goal: Identify genres with missing height data by showing the count and percentage of trees with missing (0) heights.
Concepts: COUNT() with conditions, calculating percentages with window functions
Step-by-step guide:
COUNT(*) OVER (PARTITION BY genre) for total trees per genreCOUNT(CASE WHEN height > 0 THEN 1 END) OVER (PARTITION BY genre) for trees with valid heightsHints:
COUNT(CASE WHEN height > 0 THEN 1 END) only counts trees with actual height valuesCOUNT(*) counts all rowsDISTINCT in the outer query to avoid repeated calculationsAlternative approach (more explicit):
SELECT DISTINCT
tg.genre,
COUNT(*) OVER (PARTITION BY tg.genre) as total_trees,
COUNT(*) OVER (PARTITION BY tg.genre) - COUNT(CASE WHEN t.height = 0 THEN 1 END) OVER (PARTITION BY tg.genre) as trees_with_height,
COUNT(CASE WHEN t.height = 0 THEN 1 END) OVER (PARTITION BY tg.genre) as missing_height,
ROUND(100.0 * COUNT(CASE WHEN t.height = 0 THEN 1 END) OVER (PARTITION BY tg.genre) /
COUNT(*) OVER (PARTITION BY tg.genre), 2) as pct_missing
FROM trees t
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_genres tg ON tx.genre_id = tg.id
ORDER BY pct_missing DESC;
results
genre | total_trees | trees_with_height | missing_height | pct_missing
-------------------+-------------+-------------------+----------------+-------------
Rhododendron | 8 | 0 | 8 | 100.00
Podocarpus | 2 | 0 | 2 | 100.00
Tsuga | 1 | 0 | 1 | 100.00
Cephalotaxus | 1 | 0 | 1 | 100.00
Juniperus | 5 | 0 | 5 | 100.00
Cercidiphyllum | 8 | 1 | 7 | 87.50
Note: a SQL query with
GROUP BYwould be simpler,
For instance
SELECT
tg.genre,
COUNT(*) as total_trees,
COUNT(CASE WHEN t.height > 0 THEN 1 END) as trees_with_height,
COUNT(CASE WHEN t.height = 0 THEN 1 END) as missing_height,
ROUND(100.0 * COUNT(CASE WHEN t.height = 0 THEN 1 END) / COUNT(*), 2) as pct_missing
FROM trees t
JOIN taxonomy tx ON t.taxonomy_id = tx.id
JOIN tree_genres tg ON tx.genre_id = tg.id
GROUP BY tg.genre
ORDER BY pct_missing DESC;
Goal: Assign a sequential ranking to trees by height within each domain.
Concepts: ROW_NUMBER(), ordering within partitions
Step-by-step guide:
ROW_NUMBER() OVER (PARTITION BY domain ORDER BY height DESC)Hints:
ROW_NUMBER() always gives unique sequential numbers (1, 2, 3, ...)ORDER BY height DESC puts tallest trees firstQuery Skeleton:
SELECT * FROM (
SELECT
...,
ROW_NUMBER() OVER (...) as rn
FROM ...
WHERE ...
) subquery
WHERE rn <= 5;
results
id | domain | height | height_rank
--------+--------------+--------+-------------
172122 | Alignement | 2524 | 1
152402 | Alignement | 2019 | 2
166842 | Alignement | 720 | 3
168045 | Alignement | 225 | 4
93247 | Alignement | 200 | 5
88686 | CIMETIERE | 119 | 1
91418 | CIMETIERE | 116 | 2
91576 | CIMETIERE | 35 | 3
Goal: Compare RANK() and DENSE_RANK() behavior when trees have the same circumference within each arrondissement.
Concepts: RANK(), DENSE_RANK(), understanding ties
Step-by-step guide:
RANK() and DENSE_RANK() with the same orderingHints:
RANK() leaves gaps after ties: 1, 2, 2, 4, 5DENSE_RANK() has no gaps: 1, 2, 2, 3, 4Goal: Identify unusually tall or short trees by comparing them to statistical measures within their species.
Concepts: AVG(), STDDEV() window functions, statistical outlier detection
Step-by-step guide:
AVG(height) and STDDEV(height) per species using window functions(height - avg) / stddevHints:
ABS() function gets absolute valueQuery Skeleton:
SELECT * FROM (
SELECT
...,
AVG(...) OVER (PARTITION BY ...) as avg_height,
STDDEV(...) OVER (PARTITION BY ...) as stddev_height,
(height - AVG(...) OVER (...)) / STDDEV(...) OVER (...) as z_score
FROM ...
WHERE ...
) stats
WHERE ABS(z_score) > 2
AND stddev_height IS NOT NULL;
Goal: Calculate what percentage of trees in each domain have a smaller diameter than each tree.
Concepts: PERCENT_RANK(), CUME_DIST(), understanding percentiles
Step-by-step guide:
PERCENT_RANK() to get relative rank (0 to 1)CUME_DIST() to get cumulative distributionHints:
PERCENT_RANK() returns 0 for the smallest value, 1 for the largestCUME_DIST() returns the fraction of rows <= current rowGoal: Create a two-level ranking showing the tallest trees of each species within each arrondissement.
Concepts: Multiple window functions, complex partitioning
Step-by-step guide:
Hints:
OVER() clauses with different PARTITION BY conditionsPARTITION BY arrondissement, speciesPARTITION BY arrondissementQuery Skeleton:
SELECT * FROM (
SELECT
...,
ROW_NUMBER() OVER (PARTITION BY arr, species ORDER BY height DESC) as rank_in_species,
ROW_NUMBER() OVER (PARTITION BY arr ORDER BY height DESC) as rank_in_area
FROM ...
JOIN ... ON ...
WHERE ...
) ranked
WHERE rank_in_species <= 3;
Goal: Analyze the distribution of the remarkable flag, showing how imbalanced it is across different domains and stages.
Concepts: Conditional aggregation with window functions, handling boolean/NULL values
Step-by-step guide:
remarkable = trueremarkable IS NULLDISTINCT to avoid row repetitionHints:
COUNT(CASE WHEN remarkable = true THEN 1 END) counts only TRUE valuesCOUNT(CASE WHEN remarkable IS NULL THEN 1 END) counts NULLsQuery Skeleton:
SELECT DISTINCT
domain,
stage,
COUNT(*) OVER (PARTITION BY ...) as total,
COUNT(CASE WHEN ... THEN 1 END) OVER (PARTITION BY ...) as remarkable_count,
COUNT(CASE WHEN ... IS NULL THEN 1 END) OVER (PARTITION BY ...) as null_count,
-- percentages
FROM ...
ORDER BY ...;
Goal: Create a running total showing how many trees exist at or below each circumference threshold within each genus.
Concepts: SUM() with window frames, running totals, ROWS BETWEEN
Step-by-step guide:
SUM(1) OVER (PARTITION BY genus ORDER BY circumference ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)Hints:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means "from start to here"SUM(1) is the same as COUNT(*)Query Skeleton:
SELECT
...,
SUM(1) OVER (PARTITION BY ...
ORDER BY ...
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total
FROM ...
WHERE ...
ORDER BY ...;
Goal: For trees ordered by height within each species, compare each tree to the previous and next tree.
Concepts: LAG(), LEAD(), accessing adjacent rows
Step-by-step guide:
LAG(height) OVER (PARTITION BY species ORDER BY height) to get previous tree's heightLEAD(height) OVER (PARTITION BY species ORDER BY height) to get next tree's heightHints:
LAG(column, offset) looks back offset rows (default is 1)LEAD(column, offset) looks forward offset rowsGoal: Analyze the confusing stage labels by showing how stage distribution varies across domains and highlighting unusual patterns.
Concepts: Multiple window functions, conditional logic, complex filtering
Step-by-step guide:
Hints:
Query Skeleton:
SELECT DISTINCT
domain,
stage,
COUNT(*) OVER (PARTITION BY domain, stage) as count_domain_stage,
COUNT(*) OVER (PARTITION BY domain) as count_domain,
COUNT(*) OVER (PARTITION BY stage) as count_stage,
-- percentages and other metrics
FROM ...
ORDER BY ...;
Goal: For each arrondissement, find the top 3 most common tree varieties and provide comprehensive statistics including rankings, percentages, and comparisons.
Concepts: Everything combined - multiple window functions, rankings, aggregations, complex joins
Step-by-step guide:
Hints:
DENSE_RANK() to handle ties appropriatelyQuery Skeleton:
SELECT * FROM (
SELECT
arrondissement,
variety,
COUNT(*) OVER (PARTITION BY arrondissement, variety) as variety_count,
DENSE_RANK() OVER (PARTITION BY arrondissement ORDER BY COUNT(*) OVER (PARTITION BY arrondissement, variety) DESC) as variety_rank,
AVG(...) OVER (PARTITION BY arrondissement, variety) as avg_stat,
COUNT(*) OVER (PARTITION BY arrondissement) as total_in_arrond,
-- more statistics
FROM trees t
JOIN ... ON ...
WHERE ...
) ranked_varieties
WHERE variety_rank <= 3
ORDER BY ...;
Goal: Divide trees within each genus into 4 equal groups (quartiles) based on height to understand distribution.
Concepts: NTILE(), quantile analysis
Step-by-step guide:
NTILE(4) to divide trees into 4 groupsHints:
NTILE(n) divides rows into n approximately equal groupsCOUNT(), SUM(), AVG(), MAX(), MIN(), STDDEV()ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE()PERCENT_RANK(), CUME_DIST()LAG(), LEAD()COUNT(CASE WHEN ... END)ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW