The New York Trees census dataset - Exam

Explore the 2015 census of all the trees in Manhattan

You work with the new york trees dataset. This is a subset of the dataset available on Google Big Query public datasets. Our subset only contains trees in Manhattan.

If you have not yet loaded the dataset into a new SQL database, the sql file to restore the database is available at https://skatai.com/assets/db/new_york_trees.sql.

As usual, you can restore the database from a PSQL session with the following command:

\i <path to the sql file>/new_york_trees.sql

For windows, make sure to correctly handle the path to the sql file.

The above restore command creates the new_york_trees database which contains only one table: the trees table which has records of 65423 trees.


The exam

Divided into 4 parts :

Rules

File Naming

THIS IS IMPORTANT

When you export the database follow these rules for naming the sql file

For example, if I, Alexis Perrier, save the database version 3, the filename is trees_alexis_perrier_3.sql

NOT : trees Alexis 3.sql, trees_{alexis}_{perrier}_{3}.sql, trees-Paul18.sql etc …

PSQL session

It is better to work in a psql session. If you are using pgadmin, make sure you are not in the query windows but in the proper psql session.

How to backup you database

You should make regular backups of your database as plain sql. If you make a mistake, then you can restore the database from the backup. If you are rushed finishing the exam, you can always upload the most recent version of the database.

You can use the backup menu in pgadmin to backup your database. Make sure to select plain format. Or you can use the following command in the terminal

pg_dump -U username -d database_name > filename.sql

Select into a table

You must store the output of all your queries into their own table

CREATE TABLE <target_table> AS
<your query>

Replace <target_table> by the target table name indicated in the question.

Example

The query should be :

CREATE TABLE nta_tree_count AS
select nta_name, count(*)
from trees
group by nta_name;

Then make sure the table nta_tree_count has been created by running \d nta_tree_count in your psql session.


Data

The new_york_trees database has one table trees with records of 65423 trees in Manhattan.

The schema

The trees table has the following attributes:

Attribute Data Type Description
tree_id INTEGER PRIARY KEY
block_id INTEGER Identifier linking each tree to its mapped block
created_at DATE Date the tree census data was collected
tree_dbh INTEGER Tree diameter at breast height (~54”/137cm above ground, in inches)
stump_diam INTEGER Stump diameter measured through center, in inches
status STRING Tree status: ‘Alive’, ‘Dead’, or ‘Stump’
health STRING Perceived health of tree
spc_latin STRING Scientific species name (e.g., “Acer rubrum”)
spc_common STRING Common species name (e.g., “red maple”)
user_type STRING Category of data collector
problems STRING Comma-separated list of tree problems
root_stone STRING Root problem from paving stones (‘Yes’/’No’)
root_grate STRING Root problem from metal grates (‘Yes’/’No’)
root_other STRING Other root problems (‘Yes’/’No’)
trunk_wire STRING Trunk problem from wires/rope (‘Yes’/’No’)
trnk_light STRING Trunk problem from installed lighting (‘Yes’/’No’)
trnk_other STRING Other trunk problems (‘Yes’/’No’)
brch_light STRING Branch problem from lights/wires (‘Yes’/’No’)
brch_shoe STRING Branch problem from sneakers (‘Yes’/’No’)
brch_other STRING Other branch problems (‘Yes’/’No’)
address STRING Nearest estimated address
zipcode INTEGER Five-digit zipcode
zip_city STRING City derived from zipcode
boroname STRING Borough name
nta_name STRING Neighborhood Tabulation Area (from 2010 US Census)
latitude FLOAT Decimal latitude
longitude FLOAT Decimal longitude

The columns can be grouped according to:


Part 1: Warmup quiz

In the google form, fill in your name, epita email and start by answering the warmup questions

The google form is available at

https://forms.gle/S8f2qgFdYcon23c99


2. Explore the dataset

For each of the following questions you must store the output of any query into its own table

CREATE TABLE <target_table> AS
<your query>

replace <target_table> by the target table name indicated in the question.

2.1 Most common trees

2.2 Trees with shoes in branches

2.3 count trees by status and health and both

2.4 columns with null values

The following columns have null values : problems, spc_common, spc_latin and health

Go further for your own sake: to understand the data better, check out the status of the trees that are in this trees_with_null_values table.

2.5 tree Dimensions

hint:

percentile_cont ( fraction  ) WITHIN GROUP ( ORDER BY <column name> )

2.6 trees with problems

Find trees that have the most problems : where one of the root, trunk or branch problems is ‘Yes’.

Note: This is not about the problems column, just the ‘Yes’, ‘No’ columns : root_..., trunk_..., branch_....

Attention one of the trunk related column is named trunk_wire while the other trunk related columns are named trnk_light and trnk_other, no u.

The strategy to build the query is to use a CTE with a first clause (yes_values) and a main clause

in the yes_values clause

then in the main clause just sum up all the alias columns

Check your results, the top 5 trees with the most problems are :

 tree_id | yes_count
---------+-----------
   27108 |         6
   76743 |         6
  125230 |         6
  300525 |         6
  331004 |         6

2.7 tree names

Each tree has a common name spc_common and a latin name spc_latin.

We want to know if there is a 1-to-1 equivalence between the common name and the latin name ?

This query returns number of latin names per common name where the count is greater than 1.

SELECT COUNT(DISTINCT spc_latin) AS _count, spc_common as spc
FROM trees
GROUP BY spc_common
HAVING COUNT(DISTINCT spc_latin) > 1

Using UNION, add the case where you count the number of common names for each latin name.

If the mapping is one-to-one, this query should return no results. Any rows returned indicate violations of the 1:1 mapping.


Backup

This is the end of section two: Exploration.

Before moving on to the next part, backup your database as plain sql. and upload it in the google form.

Important: Name the sql file with your name following this trees_{first_name}_{last_name}_1.sql


Part 3: improving the table

3.1 Primary key

The current primary key is tree_id, it is not sequential.

select tree_id from tree order by tree_id limit 10 shows that the primary key is not sequential.

You’re going to create a new sequential primary key instead.

In this section you should use the ALTER TABLE command to modify the table followed by ADD (column, primary key) or DROP (primary key, column).

This way, we keep the original tree_id as a column but use id as the primary key.

Check that the sequence trees_id_seq has been created.

3.2 Yes, No columns into booleans

The columns related to tree problems are strings with values ‘Yes’ or ‘No’. This is not very SQL friendly.

3.3 trees with problems

Note that casting a boolean to int will transform it to 1 or 0. TRUE becomes 1, FALSE becomes 0.

select tree_id, root_stone_bool::int from trees;

Rewrite the query from question 2.6 to count the number of problems per trees using the new boolean columns

3.4 Central Park

Central Park latitude and longitude bounds:

Assume Central Park is a rectangle. We want to extract all the trees in Central Park into a separate table.

check your results: 8.6% of manhattan trees are in Central park.

3.5 Plant a tree

Plant a new tree in the middle of Central Park

Use the following values:

for all the other values, we use the most common value for categorical columns, and the average value for dimensions.

Create a CTE named central_park_stats that:

Then in the main query: Use INSERT INTO…SELECT to insert one record from the CTE result.

If you struggle with writing the full CTE, manually look up the most common values and use these values directly into the INSERT clause


Backup

This is the end of section three: Improve the database.

Before moving on to the next part, backup your database as plain sql and upload it in the google form.

Important: Name the sql file with your name following this trees_{first_name}_{last_name}_2.sql

Part 4: Functions & Normalization

Make sure to use proper variable prefixes for all your functions

4.1 inches to cm SQL

Write a simple sql function inches_to_cm that converts inches to cm, round to 1 decimal.

one inch = 2.54 centimeters

Attention: copy paste the function in the last part of the google form

4.2 Dimensions with Pg/Plsql function

Write a function tree_description that takes a tree id (not tree_id) as input

and returns

So:

Attention: copy paste the function in the last part of the google form

You can check that your function works with the following examples:

select tree_description(6650);
           tree_description
--------------------------------------
 Dead tree, tree has 5.1 cm of height
select tree_description(371966);
        tree_description
----------------------------------------
Alive tree, tree has 10.2 cm of height

4.3 Unique tree problems

Normalize problems into a single table

The problems columns is a list of comma separated values. This is a good candidate for normalization.

Since trees can have many problems and a set of problem can be relevant for many trees, the problems <-> trees relation is a many-to-many relations. We would therefore need a many-to-many join table to fully normalize the database. In this exercise, you only create the table of unique problems.

The task is to : write a query that returns the unique values of all comma separated problems

Start by splitting the problem into 2 CTEs.

The final query uses SELECT DISTINCT to return the unique values.

The resulting table should only have 10 different problems.

Backup

This is the end of section four: Functions and normalization.

Backup your database as plain sql and upload it in the google form.

Important: Name the sql file with your name following this trees_{first_name}_{last_name}_3.sql

You’re all set!

Thank you