Normalization and advanced querying

In this worksheet you will:

load a SQL backup of the worldhits database
normalize the database
use CTEs and window functions to query the database
store the results of the queries into tables
export your database as a SQL backup
upload your final database backup as a sql file in a google form

The google form is available at this address: https://forms.gle/dqBZiRGjH8rn24dg7.

Make sure to

fill in your name and Epita email address
answer the question(s)
at the end, upload your database sql file
if you have not finished the worksheet, backup your file and submit it as is

you must provide a SQL file backup, not a binary file. If you are not sure how to do that, ask me.

Important rules

you have online access: google, postgres documentation etc

do not use a chatgpt or any other AI tool to answer the questions. I will know. It’s easy to spot.

I. Data

The data is the worldhits dataset from spotify

This dataset is a curated collection of world music, featuring 326 tracks from 66 diverse artists spanning six decades, from 1958 to 2019. It offers a rich tapestry of global sounds, from traditional rhythms to contemporary fusions. Each track is meticulously tagged with Spotify audio features, providing insights into tempo, key, energy, and more. This dataset is ideal for exploring the evolution of world music, analyzing trends across different cultures, or even training machine learning models to recognize unique musical patterns. Whether you’re a researcher, data scientist, or music enthusiast, this dataset offers a fascinating glimpse into the world’s rich musical heritage.

You can read the data reference : Spotify 300 world music tracks

It is a very clean dataset. There are no NULL values, no outliers, no crazy stuff.

Load the data

Download the SQL backup from https://skatai.com/assets/db/worldhits_backup.sql

You can restore the database

from the terminal

psql -d postgres  -f <path to your file>/worldhits_backup.sql

in a psql session

\i <path to your file>/worldhits_backup.sql

This will create a new database : worldhitsdb

Once the database is restored, connect to the newly created database with

\c worldhitsdb

The database

The database is just one table : tracks with the following columns

select * from tracks limit 5;

 \d tracks
                                         Table "public.tracks"
      Column      |          Type          | Collation | Nullable |              Default
------------------+------------------------+-----------+----------+------------------------------------
 id               | integer                |           | not null | nextval('tracks_id_seq'::regclass)
 track            | character varying(255) |           |          |
 artist           | character varying(255) |           |          |
 album            | character varying(255) |           |          |
 year             | integer                |           |          |
 duration         | integer                |           |          |
 time_signature   | integer                |           |          |
 danceability     | double precision       |           |          |
 energy           | double precision       |           |          |
 key              | integer                |           |          |
 loudness         | double precision       |           |          |
 mode             | integer                |           |          |
 speechiness      | double precision       |           |          |
 acousticness     | double precision       |           |          |
 instrumentalness | double precision       |           |          |
 liveness         | double precision       |           |          |
 valence          | double precision       |           |          |
 tempo            | double precision       |           |          |
 popularity       | integer                |           |          |
Indexes:
    "tracks_pkey" PRIMARY KEY, btree (id)

The table has 326 rows.

II. Anomalies

Anomalies and the `artist` column

Run the following query to understand the artist column. How many tracks per artist.

select count(*) as n , artist
from tracks
group by artist
order by n DESC limit 10;

Change the order to ASC to see the least frequent artists.

Spot the update, delete and insert anomalies with respect to the artist column.

Explain in the google form how the artist column can suffer from each type of anomaly.

III. Normalization

In this section you will normalize the artist column of the database

Follow the process:

create a artists table with id (primary key) and name column
insert the UNIQUE artists names from the original tracks table
add a artist_id foreign key in the tracks table
- first add a INTEGER artist_id column
- then ADD CONSTRAINT to make it a foreign key
reconcile the tracks.artist_id with the artist.id.
- hint use the artist.name = tracks.artist condition in the UPDATE statement
double check that the artists table
- has 66 rows
- the artist names are unique
- the list of tracks for Deepak Dev is Mazha Paadum and Nilakudame
Once you are sure that the data has been imported, delete the original artist column from the tracks table.

-- 1. Create the artists table
CREATE TABLE artists (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL UNIQUE
);

-- 2. Insert unique artist names from tracks table
INSERT INTO artists (name)
SELECT DISTINCT artist
FROM tracks
WHERE artist IS NOT NULL
ORDER BY artist;

-- 3. Add artist_id foreign key column to tracks table
ALTER TABLE tracks
ADD COLUMN artist_id INTEGER;

-- Add foreign key constraint
ALTER TABLE tracks
ADD CONSTRAINT fk_tracks_artist
FOREIGN KEY (artist_id) REFERENCES artists(id);

-- 4. Reconcile tracks.artist_id with artists.id
UPDATE tracks
SET artist_id = artists.id
FROM artists
WHERE tracks.artist = artists.name;

-- Optional: Verify the reconciliation
-- SELECT t.id, t.track, t.artist, t.artist_id, a.name
-- FROM tracks t
-- LEFT JOIN artists a ON t.artist_id = a.id
-- LIMIT 10;

-- Optional: After verifying, you can drop the original artist column
-- ALTER TABLE tracks DROP COLUMN artist;

IV. First backup

Once you have normalized the artist column, backup your database as plain sql file

the filename should be: **worldhitsdb__01.sql**.
no spaces in the filename.

If you are using the terminal, run the following command to backup the database. replace the filename placeholder with your filename

pg_dump --file "worldhitsdb_<your_name>_01.sql" --format=p  "worldhitsdb"

or you can use pgadmin.

Make sure

to select the plain format
and in the data tab, check “no owner”

pgadmin_backup_worldhitsdb

Once the you have the first SQL backup of the database, upload it using the google form

Saving the query results into tables

You can store the output of any query into its own table with

CREATE TABLE tracks_analysis_results AS
<your query>

For instance, this query will create a table called artist_track_count with the number of tracks per artist and the artist name.

CREATE TABLE artist_track_count AS
select count(*) as track_count, a.name
from tracks t
join artists a on t.artist_id = a.id
group by a.name
order by a.name desc;

Run that query and check that the table has been created.

In the following questions you are asked to write queries and store the results into tables. The table name will be given.

V. Window functions

You can refer to the window functions postgres documentationor course slides (window functions).

V.1. Compare Artists popularity

Write a query that compares each track’s popularity to the average popularity of their artist.

store the query result into a table called artist_popularity.

The output should look like (this is only the first row)

track	artist_name	popularity	artist_avg_popularity
When You’re Falling	Afro Celt Sound System	36	29.00

hint : use AVG ... OVER (PARTITION BY ...) for the avg popularity
alias artist.name as artist_name
make sure to ROUND the avg popularity to 2 decimal places.

CREATE TABLE artist_popularity AS
SELECT
    t.track,
    a.name AS artist_name,
    t.popularity,
    ROUND(AVG(t.popularity) OVER (PARTITION BY t.artist_id), 2) AS artist_avg_popularity
FROM tracks t
JOIN artists a ON t.artist_id = a.id
ORDER BY a.name, t.popularity DESC;

V.2. Rank tracks by energy

Rank all tracks by energy for each artist

Store the query result into a table called artist_energy_rank

use RANK() OVER() for energy_rank
in the OVER clause, use PARTITION BY and ORDER BY

The output should be similar to

              track              |         artist_name    | energy | energy_rank
---------------------------------+------------------------+--------+-------------
To the End of the World          | Pat Metheny Group      |    0.54 |       1
Last Train Home                  | Pat Metheny Group      |   0.488 |       2
The Way Up: Opening & Part One   | Pat Metheny Group      |    0.41 |       3
Phase Dance                      | Pat Metheny Group      |   0.399 |       4

CREATE TABLE artist_energy_rank AS
SELECT
    t.track,
    a.name AS artist_name,
    t.energy,
    RANK() OVER (PARTITION BY t.artist_id ORDER BY t.energy DESC) AS energy_rank
FROM tracks t
JOIN artists a ON t.artist_id = a.id
-- where a.name != 'Al Di Meola'
ORDER BY a.name, energy_rank;

VI. CTE with clause

In the following section, you will write CTE queries.

first write the main query
then name the CTE and select from it

VI.1. Compare popularity with tempo for each artist

Write a query that calculates each artist average popularity and tempo, and ranks them on both metrics.

store the final query result into a table called artist_popularity_tempo_rank

The 1st query will calculate each artist avg_popularity and avg_tempo by grouping by artist name.

The 1st query results will look like that

 id |    artist_name     | track_count |   avg_popularity    |     avg_tempo
----+--------------------+-------------+---------------------+--------------------
 54 | Ofra Haza          |           7 | 13.0000000000000000 | 125.89957142857143
 29 | Enya               |           7 | 54.4285714285714286 |  92.07600000000001
  4 | Ali Mohammed Birra |           5 | 19.0000000000000000 |            143.159

Now, name that query (for instance artist_stats) and select from it.

WITH artist_stats as (
    <1st query>
)
SELECT ...
from artist_stats

ORDER BY popularity_rank, tempo_rank;

The main query will rank the artists using a DENSE_RANK window function ordered by avg_popularity + avg_tempo

The final results should look like

      artist       | track_count | artist_avg_popularity | artist_avg_tempo | popularity_rank | tempo_rank
-------------------+-------------+-----------------------+------------------+-----------------+------------
 Manu Chao         |           9 |                 59.78 |           129.90 |               1 |          8
 Enya              |           7 |                 54.43 |            92.08 |               2 |         52
 Ilaiyaraaja       |           5 |                 50.80 |           127.36 |               3 |         13

store the final query result into a table called artist_popularity_tempo_rank

hints:

Use GROUP BY with the artist: When calculating averages per artist, group by a.name:
Apply aggregations for averages: Use AVG() function on the tracks’ columns popularity and tempo
Don’t forget ROUND(...::numeric, 2) for clean numbers
you need to cast avg_tempo and avg_popularioty to numeric for the DENSE_RANK() window function to work
use DENSE_RANK() window functions
[optional] Filter with HAVING
- HAVING filters groups AFTER aggregation: HAVING COUNT(*) >= 3

CREATE TABLE artist_popularity_tempo_rank AS
WITH artist_stats AS (
    SELECT
        a.id,
        a.name AS artist_name,
        COUNT(*) AS track_count,
        AVG(t.popularity) AS avg_popularity,
        AVG(t.tempo) AS avg_tempo
    FROM tracks t
    JOIN artists a ON t.artist_id = a.id
    WHERE t.popularity IS NOT NULL AND t.tempo IS NOT NULL
    GROUP BY a.id, a.name
    HAVING COUNT(*) >= 3
)
SELECT
    artist_name,
    track_count,
    ROUND(avg_popularity::numeric, 2) AS avg_popularity,
    ROUND(avg_tempo::numeric, 2) AS avg_tempo,
    DENSE_RANK() OVER (ORDER BY avg_popularity DESC) AS popularity_rank,
    DENSE_RANK() OVER (ORDER BY avg_tempo DESC) AS tempo_rank
FROM artist_stats
ORDER BY popularity_rank, tempo_rank;

VI.2 Find each artist’s most popular track

store the final query result into a table called artist_most_popular_track

first query: Using ROW_NUMBER or RANK or DENSE_RANK (your choice) list the tracks per artist ranked by popularity

The result should look like that

      artist       |             track              | popularity | rank
-------------------+--------------------------------+------------+------
 Youssou N'Dour    | 7 Seconds (feat. Neneh Cherry) |         47 |    1
 Youssou N'Dour    | Set                            |         25 |    2
 Youssou N'Dour    | Salimata                       |         20 |    3
 Youssou N'Dour    | Birima                         |         18 |    4
 Youssou N'Dour    | Mame Bamba                     |         16 |    5

Now name the CTE artist_tracks_ranked and select from it to get the most popular track per artist

Save the full query in a table named most_popular_track_per_artist

The final result should look like

           artist           |             track              | popularity
----------------------------+--------------------------------+------------
 Youssou N'Dour             | 7 Seconds (feat. Neneh Cherry) |         47
 Vieux Farka Touré          | Tongo Barra                    |         52
 The Kingston Trio          | Tom Dooley                     |         41

CREATE TABLE artist_most_popular_track AS

WITH artist_tracks_ranked AS (
    SELECT
        a.name AS artist,
        t.track,
        t.popularity,
        ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY t.popularity DESC) AS rank
    FROM tracks t
    JOIN artists a ON t.artist_id = a.id
    WHERE t.popularity IS NOT NULL
)
SELECT artist, track, popularity
FROM artist_tracks_ranked
WHERE rank = 1
ORDER BY popularity DESC;

VI.3 Loudness evolution of the years

Store the query result into a table called loudness_correlation_with_year

Using a CTE you are going to calculate the correlation of loudness with year

1st write a query that groups tracks by year and calculates the average loudness for each year.
then name that query and select from it to calculate the correlation of year and avg_loudness
use CORR function: CORR(year, avg_loudness)
round the result to 3 decimal places

The result should be a single number

 loudness_correlation_with_year
--------------------------------
                          0.386

-- Correlation of loudness with year
CREATE TABLE loudness_correlation_with_year AS
WITH yearly_stats AS (
    SELECT
        year,
        AVG(loudness) AS avg_loudness
    FROM tracks
    WHERE year IS NOT NULL
    GROUP BY year
)
SELECT
    ROUND(CORR(year, avg_loudness)::numeric, 3) AS loudness_correlation_with_year
FROM yearly_stats;

VI.4 Top 3 fastest and slowest songs per artist

Store the query result into a table called fast_slow_tracks

The 1st query uses RANK to list the tracks per artist ranked by tempo both for fastest and slowest tracks.

The result shoud look like that

         artist         |              track              | tempo | fast_rank | slow_rank
------------------------+---------------------------------+-------+-----------+-----------
 Afro Celt Sound System | Saor / Free / News from Nowhere |  82.0 |         4 |         1
 Afro Celt Sound System | When You're Falling             |  99.0 |         3 |         2
 Afro Celt Sound System | Release                         | 100.1 |         2 |         3

The 2nd query uses the CTE to select the top fastest and top slowest tracks per artist.

Each artists song should be tagged as “Fastest” or “Slowest”

The result should look like :

         artist         |               track                | tempo | category
------------------------+------------------------------------+-------+----------
 Afro Celt Sound System | Whirl-Y-Reel 1                     | 132.0 | Fastest
 Afro Celt Sound System | Saor / Free / News from Nowhere    |  82.0 | Slowest
 Al Di Meola            | The Wizard                         | 157.9 | Fastest
 Al Di Meola            | Race With Devil On Spanish Highway |  88.6 | Slowest

[optional] Finally, we also want to know the number of tracks per artist

Write another named query that groups by artist and counts the number of tracks per artist.

the end result should look like

         artist         | track_count |               track                | tempo | category
------------------------+-------------+------------------------------------+-------+----------
 Afro Celt Sound System |           4 | Whirl-Y-Reel 1                     | 132.0 | Fastest
 Afro Celt Sound System |           4 | Saor / Free / News from Nowhere    |  82.0 | Slowest
 Al Di Meola            |           6 | The Wizard                         | 157.9 | Fastest
 Al Di Meola            |           6 | Race With Devil On Spanish Highway |  88.6 | Slowest

CREATE TABLE fast_slow_tracks AS
WITH ranked_tracks AS (
    SELECT
        a.name AS artist,
        t.track,
        ROUND(t.tempo::numeric, 1) AS tempo,
        ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY t.tempo DESC) AS fast_rank,
        ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY t.tempo ASC) AS slow_rank
    FROM tracks t
    JOIN artists a ON t.artist_id = a.id
    WHERE t.tempo IS NOT NULL
),
track_count AS (
    SELECT
        a.id AS artist_id,
        a.name AS artist,
        COUNT(*) AS track_count
    FROM tracks t
    JOIN artists a ON t.artist_id = a.id
    GROUP BY a.id, a.name
)
SELECT
    rt.artist,
    tc.track_count,
    rt.track,
    rt.tempo,
    CASE
        WHEN rt.fast_rank = 1 THEN 'Fastest'
        WHEN rt.slow_rank = 1 THEN 'Slowest'
    END AS category
FROM ranked_tracks rt
JOIN track_count tc ON rt.artist = tc.artist
WHERE rt.fast_rank = 1 OR rt.slow_rank = 1
ORDER BY rt.artist, rt.tempo DESC;

VII final backup

Create a new SQL backup of your database.

the filename should be: **worldhitsdb__02.sql**.
no spaces in the filename.

Note the _02 postfix in the filename.

If you are using the terminal, run the following command to backup the database. replace the filename placeholder with your filename

pg_dump --file "worldhitsdb_<your_name>_02.sql" --format=p  "worldhitsdb"

or you can use pgadmin.

make sure

to select the plain format
and in the data tab, check “no owner”

pgadmin_backup_worldhitsdb

Once the you have the first SQL backup of the database, upload it using the google form

EXTRA STUFF

More complex CTE

Imagine :

You’re a music producer. A data driven music producer. You need to understand the music trends so you can decide which artists you want to work with and what types of songs you should produce.

You’re curious about how the energy and danceability of hit songs have changed over the years. This could help you understand if there’s a trend towards more upbeat, danceable music or if the market is shifting towards more mellow tracks.

Let’s take a look at danceability and energy of tracks over the years

SELECT
    year,
    AVG(energy) AS avg_energy,
    AVG(danceability) AS avg_danceability
FROM tracks
GROUP BY year;

You want to reuse these results to see how they evolve year after year. This is a job for the LAG() function!

So let’s name the above query as yearly_stats :

-- name the data gathering query as: yearly_stats
WITH yearly_stats AS (
    SELECT
        year,
        AVG(energy) AS avg_energy,
        AVG(danceability) AS avg_danceability
    FROM tracks
    GROUP BY year
)
SELECT
    year,
    avg_energy,
    avg_danceability,
    avg_energy - LAG(avg_energy) OVER (ORDER BY year) AS energy_change,
    avg_danceability - LAG(avg_danceability) OVER (ORDER BY year) AS danceability_change
-- reuse the yearly_stats results from the named query
FROM yearly_stats
ORDER BY year DESC
LIMIT 10;

Music style versatility per artist

Next, you want to identify artists who are versatile in their musical style, as they might be open to suggestions and easy to work with. You create a versatility score based on the variance (STDDEV()) in some of their tracks’ features.

This versatility score is defined as the sum of (energy_std + danceability_std + acousticness_std + instrumentalness_std)

Let’s first get the raw data, the stats.

We only keep the artists with more than 4 tracks so that the standard deviation makes sense.

-- get the main sets of results
SELECT
    artist,
    COUNT(*) AS track_count,
    ROUND(CAST(STDDEV(energy) AS numeric), 2) AS energy_std,
    ROUND(CAST(STDDEV(danceability) AS numeric), 2) AS danceability_std,
    ROUND(CAST(STDDEV(acousticness) AS numeric), 2) AS acousticness_std,
    ROUND(CAST(STDDEV(instrumentalness) AS numeric), 2) AS instrumentalness_std
FROM tracks
GROUP BY artist
HAVING COUNT(*) >= 4;

Then calculate the versatility_score by resuing the above query as a named subquery

WITH artist_stats AS (
    SELECT
        artist,
        COUNT(*) AS track_count,
        ROUND(CAST(STDDEV(energy) AS numeric), 2) AS energy_std,
        ROUND(CAST(STDDEV(danceability) AS numeric), 2) AS danceability_std,
        ROUND(CAST(STDDEV(acousticness) AS numeric), 2) AS acousticness_std,
        ROUND(CAST(STDDEV(instrumentalness) AS numeric), 2) AS instrumentalness_std
    FROM tracks
    GROUP BY artist
    HAVING COUNT(*) >= 4
)
SELECT
    artist,
    track_count,
    ROUND(
        CAST( (energy_std + danceability_std + acousticness_std + instrumentalness_std) AS numeric), 2) AS versatility_score
FROM artist_stats;

And finally we order by the versatility_score by re-using that last query as a named query artist_versatility

The final total query is

WITH artist_stats AS (
    SELECT
        artist,
        COUNT(*) AS track_count,
        ROUND(CAST(STDDEV(energy) AS numeric), 2) AS energy_std,
        ROUND(CAST(STDDEV(danceability) AS numeric), 2) AS danceability_std,
        ROUND(CAST(STDDEV(acousticness) AS numeric), 2) AS acousticness_std,
        ROUND(CAST(STDDEV(instrumentalness) AS numeric), 2) AS instrumentalness_std
    FROM tracks
    GROUP BY artist
    HAVING COUNT(*) >= 5
),
artist_versatility AS (
    SELECT
        artist,
        track_count,
        ROUND(CAST( (energy_std + danceability_std + acousticness_std + instrumentalness_std) AS numeric), 2) AS versatility_score
    FROM artist_stats
)
SELECT
    artist,
    track_count,
    versatility_score
FROM artist_versatility
ORDER BY versatility_score DESC
LIMIT 10;

Imagine what you can do now with CTEs and windows functions !!!

What does this query do ?

WITH artist_yearly_stats AS (
    SELECT
        artist,
        year,
        AVG(popularity) AS avg_popularity,
        COUNT(*) AS track_count
    FROM tracks
    -- WHERE year >= 2010
    GROUP BY artist, year
),
artist_growth AS (
    SELECT
        artist,
        SUM(track_count) AS total_tracks,
        MIN(avg_popularity) AS min_popularity,
        MAX(avg_popularity) AS max_popularity,
        MAX(avg_popularity) - MIN(avg_popularity) AS popularity_growth
    FROM artist_yearly_stats
    GROUP BY artist
    HAVING SUM(track_count) >= 3 AND COUNT(DISTINCT year) >= 2
)
SELECT
    artist,
    total_tracks,
    min_popularity,
    max_popularity,
    popularity_growth
FROM artist_growth
-- WHERE max_popularity > 60
ORDER BY popularity_growth DESC
LIMIT 10;

More WITH … AS

We’ve seen the simple WITH … AS clause.

A useful property of WITH queries is that they are normally evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries. Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work.

The downside is that

The multiply-referenced WITH query will be evaluated as written, without suppression of rows that the parent query might discard afterwards.

The query optimizer (next week) will sometimes choose to store the results of the WITH query in a temporary table, … or not.

There are cases where we want to enforce or avoid storing the results of the WITH query into a temp table. We can do that ba adding MATERIALIZED or NOT MATERIALIZED to the WITH clause:

WITH w AS (NOT) MATERIALIZED (
    SELECT * FROM big_table
)
SELECT * FROM w AS w1 JOIN w AS w2 ON w1.key = w2.ref
WHERE w2.key = 123;

Recursive

Using RECURSIVE, a WITH query can refer to its own output. A simple example is this query to sum the integers from 1 through 100:

WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;

More on recursive WITH queries in https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-RECURSIVE