Normalization and advanced querying
In this worksheet you will:
- load a SQL backup of the worldhits database
- normalize the database
- use CTEs and window functions to query the database
- store the results of the queries into tables
- export your database as a SQL backup
- upload your final database backup as a sql file in a google form
The google form is available at this address: https://forms.gle/dqBZiRGjH8rn24dg7.
Make sure to
- fill in your name and Epita email address
- answer the question(s)
- at the end, upload your database sql file
- if you have not finished the worksheet, backup your file and submit it as is
you must provide a SQL file backup, not a binary file. If you are not sure how to do that, ask me.
Important rules
- you have online access: google, postgres documentation etc
- do not use a chatgpt or any other AI tool to answer the questions. I will know. It’s easy to spot.
I. Data
The data is the worldhits dataset from spotify
This dataset is a curated collection of world music, featuring 326 tracks from 66 diverse artists spanning six decades, from 1958 to 2019. It offers a rich tapestry of global sounds, from traditional rhythms to contemporary fusions. Each track is meticulously tagged with Spotify audio features, providing insights into tempo, key, energy, and more. This dataset is ideal for exploring the evolution of world music, analyzing trends across different cultures, or even training machine learning models to recognize unique musical patterns. Whether you’re a researcher, data scientist, or music enthusiast, this dataset offers a fascinating glimpse into the world’s rich musical heritage.
You can read the data reference : Spotify 300 world music tracks
It is a very clean dataset. There are no NULL values, no outliers, no crazy stuff.
Load the data
Download the SQL backup from https://skatai.com/assets/db/worldhits_backup.sql
You can restore the database
- from the terminal
psql -d postgres -f <path to your file>/worldhits_backup.sql
- in a psql session
\i <path to your file>/worldhits_backup.sql
This will create a new database : worldhitsdb
Once the database is restored, connect to the newly created database with
\c worldhitsdb
The database
The database is just one table : tracks
with the following columns
select * from tracks limit 5;
\d tracks
Table "public.tracks"
Column | Type | Collation | Nullable | Default
------------------+------------------------+-----------+----------+------------------------------------
id | integer | | not null | nextval('tracks_id_seq'::regclass)
track | character varying(255) | | |
artist | character varying(255) | | |
album | character varying(255) | | |
year | integer | | |
duration | integer | | |
time_signature | integer | | |
danceability | double precision | | |
energy | double precision | | |
key | integer | | |
loudness | double precision | | |
mode | integer | | |
speechiness | double precision | | |
acousticness | double precision | | |
instrumentalness | double precision | | |
liveness | double precision | | |
valence | double precision | | |
tempo | double precision | | |
popularity | integer | | |
Indexes:
"tracks_pkey" PRIMARY KEY, btree (id)
The table has 326 rows.
II. Anomalies
Anomalies and the artist
column
Run the following query to understand the artist column. How many tracks per artist.
select count(*) as n , artist
from tracks
group by artist
order by n DESC limit 10;
Change the order to ASC to see the least frequent artists.
Spot the update, delete and insert anomalies with respect to the artist
column.
Explain in the google form how the artist column can suffer from each type of anomaly.
III. Normalization
In this section you will normalize the artist
column of the database
Follow the process:
- create a
artists
table withid
(primary key) andname
column - insert the
UNIQUE
artists names from the originaltracks
table - add a artist_id foreign key in the
tracks
table- first add a INTEGER artist_id column
- then ADD CONSTRAINT to make it a foreign key
- reconcile the
tracks.artist_id
with theartist.id
.- hint use the
artist.name = tracks.artist
condition in the UPDATE statement
- hint use the
- double check that the artists table
- has 66 rows
- the artist names are unique
- the list of tracks for
Deepak Dev
isMazha Paadum
andNilakudame
- Once you are sure that the data has been imported, delete the original
artist
column from the tracks table.
-- 1. Create the artists table
CREATE TABLE artists (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE
);
-- 2. Insert unique artist names from tracks table
INSERT INTO artists (name)
SELECT DISTINCT artist
FROM tracks
WHERE artist IS NOT NULL
ORDER BY artist;
-- 3. Add artist_id foreign key column to tracks table
ALTER TABLE tracks
ADD COLUMN artist_id INTEGER;
-- Add foreign key constraint
ALTER TABLE tracks
ADD CONSTRAINT fk_tracks_artist
FOREIGN KEY (artist_id) REFERENCES artists(id);
-- 4. Reconcile tracks.artist_id with artists.id
UPDATE tracks
SET artist_id = artists.id
FROM artists
WHERE tracks.artist = artists.name;
-- Optional: Verify the reconciliation
-- SELECT t.id, t.track, t.artist, t.artist_id, a.name
-- FROM tracks t
-- LEFT JOIN artists a ON t.artist_id = a.id
-- LIMIT 10;
-- Optional: After verifying, you can drop the original artist column
-- ALTER TABLE tracks DROP COLUMN artist;
IV. First backup
Once you have normalized the artist
column, backup your database as plain sql file
- the filename should be: **worldhitsdb_
_01.sql**. - no spaces in the filename.
If you are using the terminal, run the following command to backup the database. replace the filename placeholder with your filename
pg_dump --file "worldhitsdb_<your_name>_01.sql" --format=p "worldhitsdb"
or you can use pgadmin.
Make sure
- to select the plain format
- and in the data tab, check “no owner”
Once the you have the first SQL backup of the database, upload it using the google form
Saving the query results into tables
You can store the output of any query into its own table with
CREATE TABLE tracks_analysis_results AS
<your query>
For instance, this query will create a table called artist_track_count
with the number of tracks per artist and the artist name.
CREATE TABLE artist_track_count AS
select count(*) as track_count, a.name
from tracks t
join artists a on t.artist_id = a.id
group by a.name
order by a.name desc;
Run that query and check that the table has been created.
In the following questions you are asked to write queries and store the results into tables. The table name will be given.
V. Window functions
You can refer to the window functions postgres documentationor course slides (window functions).
V.1. Compare Artists popularity
Write a query that compares each track’s popularity to the average popularity of their artist.
store the query result into a table called
artist_popularity
.
The output should look like (this is only the first row)
track | artist_name | popularity | artist_avg_popularity |
---|---|---|---|
When You’re Falling | Afro Celt Sound System | 36 | 29.00 |
- hint : use
AVG ... OVER (PARTITION BY ...)
for the avg popularity - alias artist.name as artist_name
- make sure to
ROUND
the avg popularity to 2 decimal places.
CREATE TABLE artist_popularity AS
SELECT
t.track,
a.name AS artist_name,
t.popularity,
ROUND(AVG(t.popularity) OVER (PARTITION BY t.artist_id), 2) AS artist_avg_popularity
FROM tracks t
JOIN artists a ON t.artist_id = a.id
ORDER BY a.name, t.popularity DESC;
V.2. Rank tracks by energy
Rank all tracks by energy
for each artist
Store the query result into a table called
artist_energy_rank
- use
RANK() OVER()
for energy_rank - in the
OVER
clause, usePARTITION BY
andORDER BY
The output should be similar to
track | artist_name | energy | energy_rank
---------------------------------+------------------------+--------+-------------
To the End of the World | Pat Metheny Group | 0.54 | 1
Last Train Home | Pat Metheny Group | 0.488 | 2
The Way Up: Opening & Part One | Pat Metheny Group | 0.41 | 3
Phase Dance | Pat Metheny Group | 0.399 | 4
CREATE TABLE artist_energy_rank AS
SELECT
t.track,
a.name AS artist_name,
t.energy,
RANK() OVER (PARTITION BY t.artist_id ORDER BY t.energy DESC) AS energy_rank
FROM tracks t
JOIN artists a ON t.artist_id = a.id
-- where a.name != 'Al Di Meola'
ORDER BY a.name, energy_rank;
VI. CTE with clause
In the following section, you will write CTE queries.
- first write the main query
- then name the CTE and select from it
VI.1. Compare popularity with tempo for each artist
Write a query that calculates each artist average popularity and tempo, and ranks them on both metrics.
store the final query result into a table called
artist_popularity_tempo_rank
The 1st query will calculate each artist avg_popularity and avg_tempo by grouping by artist name.
The 1st query results will look like that
id | artist_name | track_count | avg_popularity | avg_tempo
----+--------------------+-------------+---------------------+--------------------
54 | Ofra Haza | 7 | 13.0000000000000000 | 125.89957142857143
29 | Enya | 7 | 54.4285714285714286 | 92.07600000000001
4 | Ali Mohammed Birra | 5 | 19.0000000000000000 | 143.159
Now, name that query (for instance artist_stats
) and select from it.
WITH artist_stats as (
<1st query>
)
SELECT ...
from artist_stats
ORDER BY popularity_rank, tempo_rank;
The main query will rank the artists using a DENSE_RANK window function ordered by avg_popularity + avg_tempo
The final results should look like
artist | track_count | artist_avg_popularity | artist_avg_tempo | popularity_rank | tempo_rank
-------------------+-------------+-----------------------+------------------+-----------------+------------
Manu Chao | 9 | 59.78 | 129.90 | 1 | 8
Enya | 7 | 54.43 | 92.08 | 2 | 52
Ilaiyaraaja | 5 | 50.80 | 127.36 | 3 | 13
store the final query result into a table called
artist_popularity_tempo_rank
hints:
- Use GROUP BY with the artist: When calculating averages per artist, group by a.name:
- Apply aggregations for averages: Use
AVG()
function on the tracks’ columns popularity and tempo - Don’t forget
ROUND(...::numeric, 2)
for clean numbers - you need to cast avg_tempo and avg_popularioty to numeric for the DENSE_RANK() window function to work
- use DENSE_RANK() window functions
- [optional] Filter with HAVING
HAVING
filters groups AFTER aggregation:HAVING COUNT(*) >= 3
CREATE TABLE artist_popularity_tempo_rank AS
WITH artist_stats AS (
SELECT
a.id,
a.name AS artist_name,
COUNT(*) AS track_count,
AVG(t.popularity) AS avg_popularity,
AVG(t.tempo) AS avg_tempo
FROM tracks t
JOIN artists a ON t.artist_id = a.id
WHERE t.popularity IS NOT NULL AND t.tempo IS NOT NULL
GROUP BY a.id, a.name
HAVING COUNT(*) >= 3
)
SELECT
artist_name,
track_count,
ROUND(avg_popularity::numeric, 2) AS avg_popularity,
ROUND(avg_tempo::numeric, 2) AS avg_tempo,
DENSE_RANK() OVER (ORDER BY avg_popularity DESC) AS popularity_rank,
DENSE_RANK() OVER (ORDER BY avg_tempo DESC) AS tempo_rank
FROM artist_stats
ORDER BY popularity_rank, tempo_rank;
VI.2 Find each artist’s most popular track
store the final query result into a table called
artist_most_popular_track
first query: Using ROW_NUMBER or RANK or DENSE_RANK (your choice) list the tracks per artist ranked by popularity
The result should look like that
artist | track | popularity | rank
-------------------+--------------------------------+------------+------
Youssou N'Dour | 7 Seconds (feat. Neneh Cherry) | 47 | 1
Youssou N'Dour | Set | 25 | 2
Youssou N'Dour | Salimata | 20 | 3
Youssou N'Dour | Birima | 18 | 4
Youssou N'Dour | Mame Bamba | 16 | 5
Now name the CTE artist_tracks_ranked
and select from it to get the most popular track per artist
Save the full query in a table named most_popular_track_per_artist
The final result should look like
artist | track | popularity
----------------------------+--------------------------------+------------
Youssou N'Dour | 7 Seconds (feat. Neneh Cherry) | 47
Vieux Farka Touré | Tongo Barra | 52
The Kingston Trio | Tom Dooley | 41
CREATE TABLE artist_most_popular_track AS
WITH artist_tracks_ranked AS (
SELECT
a.name AS artist,
t.track,
t.popularity,
ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY t.popularity DESC) AS rank
FROM tracks t
JOIN artists a ON t.artist_id = a.id
WHERE t.popularity IS NOT NULL
)
SELECT artist, track, popularity
FROM artist_tracks_ranked
WHERE rank = 1
ORDER BY popularity DESC;
VI.3 Loudness evolution of the years
Store the query result into a table called
loudness_correlation_with_year
Using a CTE you are going to calculate the correlation of loudness with year
- 1st write a query that groups tracks by year and calculates the average loudness for each year.
-
then name that query and select from it to calculate the correlation of year and avg_loudness
- use
CORR
function:CORR(year, avg_loudness)
- round the result to 3 decimal places
The result should be a single number
loudness_correlation_with_year
--------------------------------
0.386
-- Correlation of loudness with year
CREATE TABLE loudness_correlation_with_year AS
WITH yearly_stats AS (
SELECT
year,
AVG(loudness) AS avg_loudness
FROM tracks
WHERE year IS NOT NULL
GROUP BY year
)
SELECT
ROUND(CORR(year, avg_loudness)::numeric, 3) AS loudness_correlation_with_year
FROM yearly_stats;
VI.4 Top 3 fastest and slowest songs per artist
Store the query result into a table called
fast_slow_tracks
The 1st query uses RANK to list the tracks per artist ranked by tempo both for fastest and slowest tracks.
The result shoud look like that
artist | track | tempo | fast_rank | slow_rank
------------------------+---------------------------------+-------+-----------+-----------
Afro Celt Sound System | Saor / Free / News from Nowhere | 82.0 | 4 | 1
Afro Celt Sound System | When You're Falling | 99.0 | 3 | 2
Afro Celt Sound System | Release | 100.1 | 2 | 3
The 2nd query uses the CTE to select the top fastest and top slowest tracks per artist.
Each artists song should be tagged as “Fastest” or “Slowest”
The result should look like :
artist | track | tempo | category
------------------------+------------------------------------+-------+----------
Afro Celt Sound System | Whirl-Y-Reel 1 | 132.0 | Fastest
Afro Celt Sound System | Saor / Free / News from Nowhere | 82.0 | Slowest
Al Di Meola | The Wizard | 157.9 | Fastest
Al Di Meola | Race With Devil On Spanish Highway | 88.6 | Slowest
[optional] Finally, we also want to know the number of tracks per artist
Write another named query that groups by artist and counts the number of tracks per artist.
the end result should look like
artist | track_count | track | tempo | category
------------------------+-------------+------------------------------------+-------+----------
Afro Celt Sound System | 4 | Whirl-Y-Reel 1 | 132.0 | Fastest
Afro Celt Sound System | 4 | Saor / Free / News from Nowhere | 82.0 | Slowest
Al Di Meola | 6 | The Wizard | 157.9 | Fastest
Al Di Meola | 6 | Race With Devil On Spanish Highway | 88.6 | Slowest
CREATE TABLE fast_slow_tracks AS
WITH ranked_tracks AS (
SELECT
a.name AS artist,
t.track,
ROUND(t.tempo::numeric, 1) AS tempo,
ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY t.tempo DESC) AS fast_rank,
ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY t.tempo ASC) AS slow_rank
FROM tracks t
JOIN artists a ON t.artist_id = a.id
WHERE t.tempo IS NOT NULL
),
track_count AS (
SELECT
a.id AS artist_id,
a.name AS artist,
COUNT(*) AS track_count
FROM tracks t
JOIN artists a ON t.artist_id = a.id
GROUP BY a.id, a.name
)
SELECT
rt.artist,
tc.track_count,
rt.track,
rt.tempo,
CASE
WHEN rt.fast_rank = 1 THEN 'Fastest'
WHEN rt.slow_rank = 1 THEN 'Slowest'
END AS category
FROM ranked_tracks rt
JOIN track_count tc ON rt.artist = tc.artist
WHERE rt.fast_rank = 1 OR rt.slow_rank = 1
ORDER BY rt.artist, rt.tempo DESC;
VII final backup
Create a new SQL backup of your database.
- the filename should be: **worldhitsdb_
_02.sql**. - no spaces in the filename.
Note the _02
postfix in the filename.
If you are using the terminal, run the following command to backup the database. replace the filename placeholder with your filename
pg_dump --file "worldhitsdb_<your_name>_02.sql" --format=p "worldhitsdb"
or you can use pgadmin.
make sure
- to select the plain format
- and in the data tab, check “no owner”
Once the you have the first SQL backup of the database, upload it using the google form
EXTRA STUFF
More complex CTE
Imagine :
You’re a music producer. A data driven music producer. You need to understand the music trends so you can decide which artists you want to work with and what types of songs you should produce.
You’re curious about how the energy and danceability of hit songs have changed over the years. This could help you understand if there’s a trend towards more upbeat, danceable music or if the market is shifting towards more mellow tracks.
Let’s take a look at danceability and energy of tracks over the years
SELECT
year,
AVG(energy) AS avg_energy,
AVG(danceability) AS avg_danceability
FROM tracks
GROUP BY year;
You want to reuse these results to see how they evolve year after year. This is a job for the LAG()
function!
So let’s name the above query as yearly_stats
:
-- name the data gathering query as: yearly_stats
WITH yearly_stats AS (
SELECT
year,
AVG(energy) AS avg_energy,
AVG(danceability) AS avg_danceability
FROM tracks
GROUP BY year
)
SELECT
year,
avg_energy,
avg_danceability,
avg_energy - LAG(avg_energy) OVER (ORDER BY year) AS energy_change,
avg_danceability - LAG(avg_danceability) OVER (ORDER BY year) AS danceability_change
-- reuse the yearly_stats results from the named query
FROM yearly_stats
ORDER BY year DESC
LIMIT 10;
Music style versatility per artist
Next, you want to identify artists who are versatile in their musical style, as they might be open to suggestions and easy to work with. You create a versatility score based on the variance (
STDDEV()
) in some of their tracks’ features.
This versatility score is defined as the sum of (energy_std + danceability_std + acousticness_std + instrumentalness_std)
Let’s first get the raw data, the stats.
We only keep the artists with more than 4 tracks so that the standard deviation makes sense.
-- get the main sets of results
SELECT
artist,
COUNT(*) AS track_count,
ROUND(CAST(STDDEV(energy) AS numeric), 2) AS energy_std,
ROUND(CAST(STDDEV(danceability) AS numeric), 2) AS danceability_std,
ROUND(CAST(STDDEV(acousticness) AS numeric), 2) AS acousticness_std,
ROUND(CAST(STDDEV(instrumentalness) AS numeric), 2) AS instrumentalness_std
FROM tracks
GROUP BY artist
HAVING COUNT(*) >= 4;
Then calculate the versatility_score by resuing the above query as a named subquery
WITH artist_stats AS (
SELECT
artist,
COUNT(*) AS track_count,
ROUND(CAST(STDDEV(energy) AS numeric), 2) AS energy_std,
ROUND(CAST(STDDEV(danceability) AS numeric), 2) AS danceability_std,
ROUND(CAST(STDDEV(acousticness) AS numeric), 2) AS acousticness_std,
ROUND(CAST(STDDEV(instrumentalness) AS numeric), 2) AS instrumentalness_std
FROM tracks
GROUP BY artist
HAVING COUNT(*) >= 4
)
SELECT
artist,
track_count,
ROUND(
CAST( (energy_std + danceability_std + acousticness_std + instrumentalness_std) AS numeric), 2) AS versatility_score
FROM artist_stats;
And finally we order by the versatility_score
by re-using that last query as a named query artist_versatility
The final total query is
WITH artist_stats AS (
SELECT
artist,
COUNT(*) AS track_count,
ROUND(CAST(STDDEV(energy) AS numeric), 2) AS energy_std,
ROUND(CAST(STDDEV(danceability) AS numeric), 2) AS danceability_std,
ROUND(CAST(STDDEV(acousticness) AS numeric), 2) AS acousticness_std,
ROUND(CAST(STDDEV(instrumentalness) AS numeric), 2) AS instrumentalness_std
FROM tracks
GROUP BY artist
HAVING COUNT(*) >= 5
),
artist_versatility AS (
SELECT
artist,
track_count,
ROUND(CAST( (energy_std + danceability_std + acousticness_std + instrumentalness_std) AS numeric), 2) AS versatility_score
FROM artist_stats
)
SELECT
artist,
track_count,
versatility_score
FROM artist_versatility
ORDER BY versatility_score DESC
LIMIT 10;
Imagine what you can do now with CTEs and windows functions !!!
What does this query do ?
WITH artist_yearly_stats AS (
SELECT
artist,
year,
AVG(popularity) AS avg_popularity,
COUNT(*) AS track_count
FROM tracks
-- WHERE year >= 2010
GROUP BY artist, year
),
artist_growth AS (
SELECT
artist,
SUM(track_count) AS total_tracks,
MIN(avg_popularity) AS min_popularity,
MAX(avg_popularity) AS max_popularity,
MAX(avg_popularity) - MIN(avg_popularity) AS popularity_growth
FROM artist_yearly_stats
GROUP BY artist
HAVING SUM(track_count) >= 3 AND COUNT(DISTINCT year) >= 2
)
SELECT
artist,
total_tracks,
min_popularity,
max_popularity,
popularity_growth
FROM artist_growth
-- WHERE max_popularity > 60
ORDER BY popularity_growth DESC
LIMIT 10;
More WITH … AS
We’ve seen the simple WITH … AS clause.
A useful property of WITH queries is that they are normally evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work.
The downside is that
The multiply-referenced WITH query will be evaluated as written, without suppression of rows that the parent query might discard afterwards.
The query optimizer (next week) will sometimes choose to store the results of the WITH query in a temporary table, … or not.
There are cases where we want to enforce or avoid storing the results of the WITH query into a temp table.
We can do that ba adding MATERIALIZED
or NOT MATERIALIZED
to the WITH clause:
WITH w AS (NOT) MATERIALIZED (
SELECT * FROM big_table
)
SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref
WHERE w2.key = 123;
Recursive
Using RECURSIVE
, a WITH query can refer to its own output. A simple example is this query to sum the integers from 1 through 100:
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
More on recursive WITH queries in https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-RECURSIVE
Further readings
- Documentation : https://www.postgresql.org/docs/current/queries-with.html
- youtube techFTQ channel : https://www.youtube.com/watch?v=QNfnuK-1YYY