You will be working with the New York trees dataset, which contains a census of 65,423 trees in Manhattan. This exam tests your ability to query, analyze, and optimize a PostgreSQL database using intermediate SQL techniques.
Duration: 2 hours (submissions close at 12h00 - no late submissions)
Format: All queries must be submitted through the Google Form along with a database backup file.
If you have not already loaded the dataset, restore it using the SQL file available at: https://skatai.com/assets/db/new_york_trees.sql
In your psql session, run:
\i <path to the sql file>/new_york_trees.sql
This creates:
new_york_trees databasetrees table containing 65,423 tree records| Attribute | Data Type | Description |
|---|---|---|
| tree_id | INTEGER | PRIMARY KEY |
| block_id | INTEGER | Identifier linking each tree to its mapped block |
| created_at | DATE | Date the tree census data was collected |
| tree_dbh | INTEGER | Tree diameter at breast height (~54"/137cm above ground, in inches) |
| stump_diam | INTEGER | Stump diameter measured through center, in inches |
| status | STRING | Tree status: 'Alive', 'Dead', or 'Stump' |
| health | STRING | Perceived health of tree |
| spc_latin | STRING | Scientific species name (e.g., "Acer rubrum") |
| spc_common | STRING | Common species name (e.g., "red maple") |
| user_type | STRING | Category of data collector |
| problems | STRING | Comma-separated list of tree problems |
| root_stone | STRING | Root problem from paving stones ('Yes'/'No') |
| root_grate | STRING | Root problem from metal grates ('Yes'/'No') |
| root_other | STRING | Other root problems ('Yes'/'No') |
| trunk_wire | STRING | Trunk problem from wires/rope ('Yes'/'No') |
| trnk_light | STRING | Trunk problem from installed lighting ('Yes'/'No') |
| trnk_other | STRING | Other trunk problems ('Yes'/'No') |
| brch_light | STRING | Branch problem from lights/wires ('Yes'/'No') |
| brch_shoe | STRING | Branch problem from sneakers ('Yes'/'No') |
| brch_other | STRING | Other branch problems ('Yes'/'No') |
| address | STRING | Nearest estimated address |
| zipcode | INTEGER | Five-digit zipcode |
| zip_city | STRING | City derived from zipcode |
| boroname | STRING | Borough name |
| nta_name | STRING | Neighborhood Tabulation Area (from 2010 US Census) |
| latitude | FLOAT | Decimal latitude |
| longitude | FLOAT | Decimal longitude |
tree_dbh, stump_diamhealth, statususer_typespc_common, spc_latinproblems (general), root_* (root problems), trnk_* (trunk problems), brch_* (branch problems)address, zipcode, zip_city, boroname, nta_name, latitude, longitudeQuery Storage: Every SELECT query must be saved into a temporary table using the following syntax:
SELECT <columns>
INTO <target_table_name>
FROM trees;
Each question specifies the exact target table name. For example:
-- Question: Count trees per neighborhood
-- Target table: nta_tree_count
SELECT nta_name, COUNT(*) AS tree_count
INTO nta_tree_count
FROM trees
GROUP BY nta_name;
After each query, verify the table was created:
\d nta_tree_count
pg_dump -U username -d new_york_trees > backup.sqlThis section focuses on basic querying and data exploration.
Objective: Identify the top 10 most frequently occurring tree species.
Question: What are the top 10 tree species by count, ordered by common name?
Target table: most_common_trees
Expected columns: spc_common, count (or similar naming)
Hint: Use GROUP BY with COUNT(*), ORDER BY DESC, and LIMIT 10.
Objective: Quantify trees affected by a specific, unusual problem.
Question: How many trees have shoes in their branches (where brch_shoe indicates 'Yes')?
Target table: trees_with_shoes_in_branches
Expected columns: A single count value or tree_id listing
Hint: Use COUNT(*) and filter on the brch_shoe column.
Objective: Create a crosstab summary of tree conditions.
Question: How many trees exist for each combination of status and health?
Target table: trees_by_status_and_health
Expected columns: status, health, count
Solution:
SELECT status, health, COUNT(*) AS tree_count
INTO trees_by_status_and_health
FROM trees
GROUP BY status, health;
Objective: Identify incomplete data records.
Question: Which trees have null values in any of the following columns: problems, spc_common, spc_latin, or health?
Target table: trees_with_null_values
Expected columns: tree_id
Instructions:
Hint: Use the OR operator to combine multiple IS NULL conditions.
Solution:
SELECT tree_id
INTO trees_with_null_values
FROM trees
WHERE problems IS NULL
OR spc_common IS NULL
OR spc_latin IS NULL
OR health IS NULL;
Follow-up: Create a quick query to count alive trees in this set:
SELECT COUNT(DISTINCT t.tree_id) AS alive_with_nulls
FROM trees t
JOIN trees_with_null_values tn ON t.tree_id = tn.tree_id
WHERE t.status = 'Alive';
Objective: Calculate statistical metrics for tree physical dimensions, stratified by tree status.
Question: For each tree status, calculate the minimum, maximum, average, and 90th percentile of tree diameter (tree_dbh) and stump diameter (stump_diam). Round averages and percentiles to 2 decimal places.
Target table: dimensions_by_status
Expected columns: status, min_dbh, max_dbh, avg_dbh, percentile_90_dbh, min_stump, max_stump, avg_stump, percentile_90_stump
Difficulty: Intermediate
Instructions:
status is NULLPERCENTILE_CONT() for calculating the 90th percentilenumeric before roundingROUND(value, 2) to format numerical outputKey PostgreSQL Functions:
PERCENTILE_CONT(fraction) WITHIN GROUP (ORDER BY column_name) - calculates continuous percentile::numeric - casting operator to convert to numeric typeROUND(value, decimals) - rounds to specified decimal placesHint: You will need two similar PERCENTILE_CONT() calls in your query. Avoid writing the formula twice by carefully structuring your SELECT clause.
Solution:
SELECT
status,
MIN(tree_dbh) AS min_dbh,
MAX(tree_dbh) AS max_dbh,
ROUND(AVG(tree_dbh)::numeric, 2) AS avg_dbh,
ROUND(PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY tree_dbh)::numeric, 2) AS percentile_90_dbh,
MIN(stump_diam) AS min_stump,
MAX(stump_diam) AS max_stump,
ROUND(AVG(stump_diam)::numeric, 2) AS avg_stump,
ROUND(PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY stump_diam)::numeric, 2) AS percentile_90_stump
INTO dimensions_by_status
FROM trees
WHERE status IS NOT NULL
GROUP BY status;
Objective: Identify and rank trees by the number of problems they have.
Question: Create a list of all trees showing how many problems each tree has. Problems are identified by counting occurrences of 'Yes' in the following columns: root_stone, root_grate, root_other, trunk_wire, trnk_light, trnk_other, brch_light, brch_shoe, brch_other. Sort by problem count descending.
Target table: trees_with_problems
Expected columns: tree_id, problem_count
Difficulty: Intermediate (CTE + CASE statements)
Strategy:
Why use a CTE? CTEs improve readability and allow intermediate results to be tested and debugged separately.
Key Concept - CASE Statements: Convert text values to numeric:
CASE WHEN column_name = 'Yes' THEN 1 ELSE 0 END AS column_name_yes
Hint: Your CTE should have 9 CASE statements (one for each problem column). The main query should sum all 9.
Solution:
WITH yes_values AS (
SELECT
tree_id,
CASE WHEN root_stone = 'Yes' THEN 1 ELSE 0 END AS root_stone_yes,
CASE WHEN root_grate = 'Yes' THEN 1 ELSE 0 END AS root_grate_yes,
CASE WHEN root_other = 'Yes' THEN 1 ELSE 0 END AS root_other_yes,
CASE WHEN trunk_wire = 'Yes' THEN 1 ELSE 0 END AS trunk_wire_yes,
CASE WHEN trnk_light = 'Yes' THEN 1 ELSE 0 END AS trnk_light_yes,
CASE WHEN trnk_other = 'Yes' THEN 1 ELSE 0 END AS trnk_other_yes,
CASE WHEN brch_light = 'Yes' THEN 1 ELSE 0 END AS brch_light_yes,
CASE WHEN brch_shoe = 'Yes' THEN 1 ELSE 0 END AS brch_shoe_yes,
CASE WHEN brch_other = 'Yes' THEN 1 ELSE 0 END AS brch_other_yes
FROM trees
)
SELECT
tree_id,
root_stone_yes + root_grate_yes + root_other_yes +
trunk_wire_yes + trnk_light_yes + trnk_other_yes +
brch_light_yes + brch_shoe_yes + brch_other_yes AS problem_count
INTO trees_with_problems
FROM yes_values
ORDER BY problem_count DESC;
This section focuses on schema optimization through data type conversion and index creation.
Objective: Convert string-based Yes/No data to proper boolean columns for storage efficiency and query performance.
Question: Add 9 new boolean columns to the trees table to replace string-based problem indicators.
Instructions:
false:
root_stone_boolroot_grate_boolroot_other_booltrunk_wire_booltrnk_light_booltrnk_other_boolbrch_light_boolbrch_shoe_boolbrch_other_boolSQL to add columns:
ALTER TABLE trees ADD COLUMN root_stone_bool BOOLEAN DEFAULT false;
ALTER TABLE trees ADD COLUMN root_grate_bool BOOLEAN DEFAULT false;
ALTER TABLE trees ADD COLUMN root_other_bool BOOLEAN DEFAULT false;
ALTER TABLE trees ADD COLUMN trunk_wire_bool BOOLEAN DEFAULT false;
ALTER TABLE trees ADD COLUMN trnk_light_bool BOOLEAN DEFAULT false;
ALTER TABLE trees ADD COLUMN trnk_other_bool BOOLEAN DEFAULT false;
ALTER TABLE trees ADD COLUMN brch_light_bool BOOLEAN DEFAULT false;
ALTER TABLE trees ADD COLUMN brch_shoe_bool BOOLEAN DEFAULT false;
ALTER TABLE trees ADD COLUMN brch_other_bool BOOLEAN DEFAULT false;
true for all rows where the corresponding string column equals 'Yes':UPDATE trees SET root_stone_bool = true WHERE root_stone = 'Yes';
UPDATE trees SET root_grate_bool = true WHERE root_grate = 'Yes';
UPDATE trees SET root_other_bool = true WHERE root_other = 'Yes';
UPDATE trees SET trunk_wire_bool = true WHERE trunk_wire = 'Yes';
UPDATE trees SET trnk_light_bool = true WHERE trnk_light = 'Yes';
UPDATE trees SET trnk_other_bool = true WHERE trnk_other = 'Yes';
UPDATE trees SET brch_light_bool = true WHERE brch_light = 'Yes';
UPDATE trees SET brch_shoe_bool = true WHERE brch_shoe = 'Yes';
UPDATE trees SET brch_other_bool = true WHERE brch_other = 'Yes';
SELECT tree_id, root_stone, root_stone_bool FROM trees LIMIT 5;
Objective: Recalculate tree problems using the new boolean columns.
Question: Recreate the trees with problems query using the new boolean columns. Note that boolean values cast to integers: TRUE → 1, FALSE → 0.
Target table: trees_with_problems_bool
Expected columns: tree_id, problem_count
Key Concept - Boolean Casting: In PostgreSQL, you can cast boolean to integer using ::int:
root_stone_bool::int -- Returns 1 for TRUE, 0 for FALSE
Advantage: This eliminates the need for a CTE with CASE statements.
Hint: Your query will be significantly simpler than the CTE version. Add all 9 boolean columns cast to integers.
Solution:
SELECT
tree_id,
root_stone_bool::int + root_grate_bool::int + root_other_bool::int +
trunk_wire_bool::int + trnk_light_bool::int + trnk_other_bool::int +
brch_light_bool::int + brch_shoe_bool::int + brch_other_bool::int AS problem_count
INTO trees_with_problems_bool
FROM trees
ORDER BY problem_count DESC;
Objective: Use geographic bounding to filter relevant data.
Question: Find all trees located within Central Park using the following approximate boundaries.
Central Park Geographic Bounds:
Target table: trees_in_central_park
Expected columns: tree_id, latitude, longitude
Instructions: Assume Central Park is a rectangle and select all trees falling within these bounds.
Hint: Use BETWEEN for latitude and longitude ranges. Note that western longitudes are negative.
Note: Central Park contains approximately 8.6% of Manhattan's trees.
Solution:
SELECT tree_id, latitude, longitude
INTO trees_in_central_park
FROM trees
WHERE latitude BETWEEN 40.7829 AND 40.8019
AND longitude BETWEEN -73.9730 AND -73.9485;
This section covers advanced SQL techniques including window functions, advanced aggregation, and database design patterns.
Objective: Use aggregation and window functions to calculate statistics, then insert a synthetic record.
Question: Plant a new tree in Central Park with average and most common attributes based on existing Central Park trees. Use the following specifications:
Target: Insert one new row into the trees table
Expected values for the new tree:
tree_id: MAX(tree_id) + 1latitude / longitude: 40.7915, -73.9608tree_dbh: Rounded average of Central Park treesspc_latin & spc_common: Most common speciesuser_type: Most common collector typeDifficulty: Advanced (CTE + window functions + data aggregation)
Strategy:
Create a CTE named central_park_stats that:
trees_in_central_park table from Part 2.3)tree_dbh (rounded), most common spc_latin, most common spc_common, most common user_typeMODE() WITHIN GROUP (ORDER BY column) to find the most common categorical valuesIn the main query: Use INSERT INTO...SELECT to insert one record from the CTE
Key PostgreSQL Function:
MODE() WITHIN GROUP (ORDER BY column) - Returns the most frequent value (statistical mode). If there's a tie, it returns the first value encountered.
Hints:
(SELECT MAX(tree_id) FROM trees) + 1trees_in_central_park from section 2.3 in your WHERE clauseSolution:
WITH central_park_stats AS (
SELECT
ROUND(AVG(t.tree_dbh)) AS avg_tree_dbh,
MODE() WITHIN GROUP (ORDER BY t.spc_latin) AS common_spc_latin,
MODE() WITHIN GROUP (ORDER BY t.spc_common) AS common_spc_common,
MODE() WITHIN GROUP (ORDER BY t.user_type) AS common_user_type
FROM trees t
WHERE t.tree_id IN (SELECT tree_id FROM trees_in_central_park)
)
INSERT INTO trees (
tree_id, latitude, longitude, tree_dbh, status, health,
spc_latin, spc_common, user_type, problems,
zipcode, zip_city, boroname, nta_name
)
SELECT
(SELECT MAX(tree_id) + 1 FROM trees),
40.7915,
-73.9608,
avg_tree_dbh,
'Alive',
'Good',
common_spc_latin,
common_spc_common,
common_user_type,
NULL,
'10024',
'Manhattan',
'Manhattan',
'Upper West Side'
FROM central_park_stats;
Verification: Query the new tree to confirm insertion:
SELECT tree_id, spc_common, tree_dbh, latitude, longitude FROM trees WHERE latitude = 40.7915;
Objective: Encapsulate query logic in a reusable view.
Question: Create a view named v_species_summary that shows, for each common species name:
spc_common)avg_dbh)Expected View Columns: spc_common, tree_count, avg_dbh
Instructions:
CREATE OR REPLACE VIEW v_species_summary AS
SELECT
spc_common,
COUNT(*) AS tree_count,
ROUND(AVG(tree_dbh)::numeric, 2) AS avg_dbh
FROM trees
WHERE spc_common IS NOT NULL
GROUP BY spc_common
ORDER BY tree_count DESC;
Verification:
SELECT * FROM v_species_summary LIMIT 10;
This section focuses on writing reusable SQL and PL/pgSQL functions.
Objective: Create a simple scalar function.
Question: Write a SQL function that converts inches to centimeters. Round the result to 1 decimal place. (1 inch = 2.54 cm)
Function name: inches_to_cm
Input parameter: inches (NUMERIC type)
Returns: NUMERIC type
Solution:
CREATE OR REPLACE FUNCTION inches_to_cm(inches NUMERIC)
RETURNS NUMERIC AS $$
SELECT ROUND(inches * 2.54, 1);
$$ LANGUAGE SQL;
Testing:
SELECT inches_to_cm(10); -- Should return 25.4
SELECT inches_to_cm(50); -- Should return 127.0
Objective: Write a conditional PL/pgSQL function that generates descriptive text.
Question: Write a function that takes a tree_id as input and returns a descriptive text string about the tree. The function should:
inches_to_cm function)Function name: tree_description
Input parameter: tree_id_input (INTEGER type)
Returns: TEXT type
Difficulty: Intermediate (PL/pgSQL + conditional logic + type casting)
Instructions:
inches_to_cm function for unit conversionExample calls and expected outputs:
SELECT tree_description(6650);
-- Output: "Dead tree, tree has 5.1 cm of height"
SELECT tree_description(371966);
-- Output: "Alive tree, tree has 10.2 cm of height"
SELECT tree_description(1); -- Assuming this is a stump
-- Output: "Stump tree, stump has X.X cm of stump diameter"
Hint: You can call one SQL function from within a PL/pgSQL function. Use:
SELECT inches_to_cm(tree_dbh) INTO variable_name FROM ...
Solution:
CREATE OR REPLACE FUNCTION tree_description(tree_id_input INT)
RETURNS TEXT AS $$
DECLARE
v_status TEXT;
v_tree_dbh INTEGER;
v_stump_diam INTEGER;
v_result TEXT;
BEGIN
SELECT status, tree_dbh, stump_diam INTO v_status, v_tree_dbh, v_stump_diam
FROM trees
WHERE tree_id = tree_id_input;
IF v_status = 'Stump' THEN
v_result := 'Stump tree, stump has ' || ROUND(inches_to_cm(v_stump_diam)::numeric, 1) || ' cm of stump diameter';
ELSE
v_result := v_status || ' tree, tree has ' || ROUND(inches_to_cm(v_tree_dbh)::numeric, 1) || ' cm of height';
END IF;
RETURN v_result;
END;
$$ LANGUAGE plpgsql;
Objective: Create a function that finds trees with a given problem.
Question: Write a function that takes a problem name as input and returns all tree_ids that have this specific problem. The problems column contains comma-separated values, so you'll need to parse them.
Function name: get_trees_with_problem
Input parameter: problem_item (TEXT type)
Returns: TABLE with column id (INTEGER)
Difficulty: Intermediate (PL/pgSQL + string parsing + RETURN QUERY)
Key PostgreSQL Functions:
STRING_TO_ARRAY(string, delimiter) - Splits a string into an arrayANY(array) - Array membership operatorRETURN QUERY - Returns multiple rows from a query resultStrategy:
problems string into an array using comma delimiterANY operatorSolution:
CREATE OR REPLACE FUNCTION get_trees_with_problem(problem_item TEXT)
RETURNS TABLE(id INT) AS $$
BEGIN
RETURN QUERY
SELECT trees.tree_id
FROM trees
WHERE problems IS NOT NULL
AND problem_item = ANY(STRING_TO_ARRAY(problems, ','));
END;
$$ LANGUAGE plpgsql;
Testing:
-- First, see what problems exist:
SELECT DISTINCT item
FROM (
SELECT STRING_TO_ARRAY(problems, ',') AS problem_array
FROM trees
WHERE problems IS NOT NULL
), UNNEST(problem_array) AS item
ORDER BY item;
-- Then query:
SELECT COUNT(*) FROM get_trees_with_problem('Debris');
This section focuses on index creation and query plan analysis.
Objective: Understand and interpret EXPLAIN output.
Question: Run the following query and capture the execution plan using EXPLAIN:
SELECT tree_id, spc_common, spc_latin, health, status
FROM trees
WHERE nta_name = 'Manhattanville';
Instructions:
EXPLAIN (FORMAT JSON) to get a detailed, machine-readable plan (or use default format)Note: Without an index, PostgreSQL performs a Sequential Scan, reading every row.
Example EXPLAIN output (without index):
Seq Scan on trees (cost=0.00..1850.75 rows=903 width=47)
Filter: ((nta_name)::text = 'Manhattanville'::text)
Objective: Optimize queries that filter by neighborhood.
Question: Create a B-tree index on the nta_name column:
CREATE INDEX idx_trees_nta_name ON trees(nta_name);
Objective: Observe performance improvement from index usage.
Question: Run the same query from 5.1 again and capture the new execution plan:
EXPLAIN SELECT tree_id, spc_common, spc_latin, health, status
FROM trees
WHERE nta_name = 'Manhattanville';
What changed?
Example EXPLAIN output (with index):
Bitmap Heap Scan on trees (cost=11.29..1476.70 rows=903 width=47)
Recheck Cond: ((nta_name)::text = 'Manhattanville'::text)
-> Bitmap Index Scan on idx_trees_nta_name (cost=0.00..11.06 rows=903 width=0)
Index Cond: ((nta_name)::text = 'Manhattanville'::text)
Document your observations:
Consider where else indexes might help. The following columns are candidates for indexing based on their use in queries:
High-cardinality columns with frequent equality lookups:
tree_id (PRIMARY KEY - already indexed)zipcode (frequently filtered)boroname (frequently filtered)spc_common (frequently grouped/filtered)status (frequently filtered)Question (Optional): Create one additional index on a column you think is frequently queried. Justify your choice.
Before submitting, ensure:
pg_dump and saved as .sql file\d target_table_name