You work with the new york trees dataset. This is a subset of the dataset available on Google Big Query public datasets. Our subset only contains trees in Manhattan.
If you have not yet loaded the dataset into a new SQL database, the sql file to restore the database is available at https://skatai.com/assets/db/new_york_trees.sql.
As usual, you can restore the database from a PSQL session with the following command:
\i <path to the sql file>/new_york_trees.sql
For windows, make sure to correctly handle the path to the sql file.
The above restore command creates the new_york_trees database which contains only one table: the trees table which has records of 65423 trees.
Divided into 4 parts :
You have two hours. You must submit your answers at 18h. No late submissions.
In the google form you should
You must provide a SQL file backup, not a binary file.
You have online access: google, postgres documentation etc
DO NOT USE chatgpt or any other AI tool to answer the questions. I will know. It’s easy to spot.
THIS IS IMPORTANT
When you export the database follow these rules for naming the sql file
trees_{first_name}_{last_name}_{version}.sqlFor example, if I, Alexis Perrier, save the database version 3, the filename is trees_alexis_perrier_3.sql
NOT : trees Alexis 3.sql, trees_{alexis}_{perrier}_{3}.sql, trees-Paul18.sql etc ...
It is better to work in a psql session. If you are using pgadmin, make sure you are not in the query windows but in the proper psql session.
You should make regular backups of your database as plain sql. If you make a mistake, then you can restore the database from the backup. If you are rushed finishing the exam, you can always upload the most recent version of the database.
You can use the backup menu in pgadmin to backup your database. Make sure to select plain format. Or you can use the following command in the terminal
pg_dump -U username -d database_name > filename.sql
You must store the output of all your queries into their own table
CREATE TABLE <target_table> AS
<your query>
Replace <target_table> by the target table name indicated in the question.
nta_tree_countThe query should be :
CREATE TABLE nta_tree_count AS
select nta_name, count(*)
from trees
group by nta_name;
Then make sure the table nta_tree_count has been created by running \d nta_tree_count in your psql session.
The new_york_trees database has one table trees with records of 65423 trees in Manhattan.
The trees table has the following attributes:
| Attribute | Data Type | Description |
|---|---|---|
| tree_id | INTEGER | PRIARY KEY |
| block_id | INTEGER | Identifier linking each tree to its mapped block |
| created_at | DATE | Date the tree census data was collected |
| tree_dbh | INTEGER | Tree diameter at breast height (~54"/137cm above ground, in inches) |
| stump_diam | INTEGER | Stump diameter measured through center, in inches |
| status | STRING | Tree status: 'Alive', 'Dead', or 'Stump' |
| health | STRING | Perceived health of tree |
| spc_latin | STRING | Scientific species name (e.g., "Acer rubrum") |
| spc_common | STRING | Common species name (e.g., "red maple") |
| user_type | STRING | Category of data collector |
| problems | STRING | Comma-separated list of tree problems |
| root_stone | STRING | Root problem from paving stones ('Yes'/'No') |
| root_grate | STRING | Root problem from metal grates ('Yes'/'No') |
| root_other | STRING | Other root problems ('Yes'/'No') |
| trunk_wire | STRING | Trunk problem from wires/rope ('Yes'/'No') |
| trnk_light | STRING | Trunk problem from installed lighting ('Yes'/'No') |
| trnk_other | STRING | Other trunk problems ('Yes'/'No') |
| brch_light | STRING | Branch problem from lights/wires ('Yes'/'No') |
| brch_shoe | STRING | Branch problem from sneakers ('Yes'/'No') |
| brch_other | STRING | Other branch problems ('Yes'/'No') |
| address | STRING | Nearest estimated address |
| zipcode | INTEGER | Five-digit zipcode |
| zip_city | STRING | City derived from zipcode |
| boroname | STRING | Borough name |
| nta_name | STRING | Neighborhood Tabulation Area (from 2010 US Census) |
| latitude | FLOAT | Decimal latitude |
| longitude | FLOAT | Decimal longitude |
The columns can be grouped according to:
tree_dbh and stump_diamhealth and statususer_typespc_common and spc_latinproblems, root related problems root_stone, roo_grate, root_other, trunk related problems trunk_wire, trunk_light, trunk_other, branch related problems brch_light, brch_shoe, brch_otheraddress, zipcode, zip_city, boroname, nta_name, latitude, longitudeIn the google form, fill in your name, epita email and start by answering the warmup questions
The google form is available at
For each of the following questions you must store the output of any query into its own table
CREATE TABLE <target_table> AS
<your query>
replace <target_table> by the target table name indicated in the question.
spc_common ?most_common_treesnumber_trees , spc_commonnumber_trees desctrees_with_shoes_in_branchesnumber_trees, 1 rowtrees_by_status_and_healthnumber_trees, status, healthstatus, healthnull valuesThe following columns have null values : problems, spc_common, spc_latin and health
tree_id of trees that have a null value in either one of these columnstrees_with_null_valuestree_idGo further for your own sake: to understand the data better, check out the status of the trees that are in this trees_with_null_values table.
dimensions_by_statusstatus, status, min, max, round, percentile_90hint:
numeric in order to apply ROUNDpercentile_cont ( fraction ) WITHIN GROUP ( ORDER BY <column name> )
Find trees that have the most problems : where one of the root, trunk or branch problems is 'Yes'.
Note: This is not about the problems column, just the 'Yes', 'No' columns : root_..., trunk_..., branch_....
Attention one of the trunk related column is named trunk_wire while the other trunk related columns are named trnk_light and trnk_other, no u.
trees_with_problemstrees_with_problems table should have the following columns : tree_id, yes_countThe strategy to build the query is to use a CTE with a first clause (yes_values) and a main clause
in the yes_values clause
case to transform the string 'Yes' into 1 and 'No' into 0 : CASE WHEN root_stone = 'Yes' THEN 1 ELSE 0 ENDAS root_stone_yesthen in the main clause just sum up all the alias columns
Check your results, the top 5 trees with the most problems are :
tree_id | yes_count
---------+-----------
27108 | 6
76743 | 6
125230 | 6
300525 | 6
331004 | 6
Each tree has a common name spc_common and a latin name spc_latin.
We want to know if there is a 1-to-1 equivalence between the common name and the latin name ?
This query returns number of latin names per common name where the count is greater than 1.
SELECT COUNT(DISTINCT spc_latin) AS _count, spc_common as spc
FROM trees
GROUP BY spc_common
HAVING COUNT(DISTINCT spc_latin) > 1
Using UNION, add the case where you count the number of common names for each latin name.
If the mapping is one-to-one, this query should return no results. Any rows returned indicate violations of the 1:1 mapping.
tree_name_multiples_count, spcThis is the end of section two: Exploration.
Before moving on to the next part, backup your database as plain sql. and upload it in the google form.
Important: Name the sql file with your name following this trees_{first_name}_{last_name}_1.sql
{first_name} and {last_name} with your frst and last name..sql extensionThe current primary key is tree_id, it is not sequential.
select tree_id from tree order by tree_id limit 10 shows that the primary key is not sequential.
You're going to create a new sequential primary key instead.
In this section you should use the ALTER TABLE command to modify the table followed by ADD (column, primary key) or DROP (primary key, column).
first, create a new column called id with data type serial
then, drop the primary key index
\d trees to find the name of the primary key index (there should be only one index on the table.)finally, add the primary key constraint on the new column
This way, we keep the original tree_id as a column but use id as the primary key.
Check that the sequence trees_id_seq has been created.
Yes, No columns into booleansThe columns related to tree problems are strings with values 'Yes' or 'No'. This is not very SQL friendly.
root_stone -> root_stone_bool.False by defaultNote that casting a boolean to int will transform it to 1 or 0. TRUE becomes 1, FALSE becomes 0.
select tree_id, root_stone_bool::int from trees;
Rewrite the query from question 2.6 to count the number of problems per trees using the new boolean columns
tree_id, problem_countproblem_count, tree_idCentral Park latitude and longitude bounds:
Assume Central Park is a rectangle. We want to extract all the trees in Central Park into a separate table.
tree_id, latitude, longitudecheck your results: 8.6% of manhattan trees are in Central park.
Plant a new tree in the middle of Central Park
Use the following values:
status: Aliveproblems: null;zipcode: 83zip_city: Central Parkboroname: Manhattannta_name: Upper West Sidestump_diam: 0(SELECT MAX(tree_id) FROM trees) + 1for all the other values, we use the most common value for categorical columns, and the average value for dimensions.
Create a CTE named central_park_stats that:
tree_dbh (rounded), most common spc_latin, most common spc_common, most common user_typeMODE() WITHIN GROUP (ORDER BY column) to find the most common categorical valuesThen in the main query: Use INSERT INTO…SELECT to insert one record from the CTE result.
If you struggle with writing the full CTE, manually look up the most common values and use these values directly into the INSERT clause
This is the end of section three: Improve the database.
Before moving on to the next part, backup your database as plain sql and upload it in the google form.
Important: Name the sql file with your name following this trees_{first_name}_{last_name}_2.sql
{first_name} and {last_name} with your first and last name..sql extensionMake sure to use proper variable prefixes for all your functions
p_ prefixv_ prefixout_ prefix (for OUT parameters)Write a simple sql function inches_to_cm that converts inches to cm, round to 1 decimal.
one inch = 2.54 centimeters
Attention: copy paste the function in the last part of the google form
Write a function tree_description that takes a tree id (not tree_id) as input
and returns
So:
Attention: copy paste the function in the last part of the google form
You can check that your function works with the following examples:
select tree_description(6650);
tree_description
--------------------------------------
Dead tree, tree has 5.1 cm of height
select tree_description(371966);
tree_description
----------------------------------------
Alive tree, tree has 10.2 cm of height
Normalize problems into a single table
problemThe problems columns is a list of comma separated values. This is a good candidate for normalization.
Since trees can have many problems and a set of problem can be relevant for many trees, the problems <-> trees relation is a many-to-many relations. We would therefore need a many-to-many join table to fully normalize the database. In this exercise, you only create the table of unique problems.
The task is to : write a query that returns the unique values of all comma separated problems
Start by splitting the problem into 2 CTEs.
STRING_TO_ARRAY to split the problems into an array.UNNEST to unnest the array into a table.The final query uses SELECT DISTINCT to return the unique values.
The resulting table should only have 10 different problems.
This is the end of section four: Functions and normalization.
Backup your database as plain sql and upload it in the google form.
Important: Name the sql file with your name following this trees_{first_name}_{last_name}_3.sql
{first_name} and {last_name} with your first and last name..sql extensionYou're all set!
Thank you