Normalization Lab Solutions

normalization, spotify dataset, SQL aggregation

solution to S04 practice on normalization

Part 1 : explore the data

how many artists?

select count(distinct artist) from tracks;

number of tracks per artist?

SELECT artist, COUNT(*) AS track_count
FROM tracks
GROUP BY artist
ORDER BY track_count desc;

Who are the artists with 1 track only?

SELECT artist
FROM tracks
GROUP BY artist
HAVING COUNT(*) = 1;

or re-using the query from Q2:

SELECT artist
FROM (
    SELECT artist, COUNT(*) AS track_count
    FROM tracks
    GROUP BY artist
) AS artist_counts
WHERE track_count = 1;

what is the average track duration per artist?

SELECT artist, AVG(duration) AS avg_duration
FROM tracks
GROUP BY artist;

II Why normalize?

Why is the artist column a good candidate for normalization ? Think of anomalies and list 2 reasons.

Insertion error: adding an artist with no track forces us to add null values to many rows Update error: if an artist changes name, then we have to update multiple tracks at once Deletion error: multiple artist have only one track. so if we delete the traclk, the aritst name diseapers.

So the artist column checks all 3 types of anomalies.

III Normalization

create a new artists table: id as primary key and name as text (or varchar)

CREATE table artists (
    id serial primary key,
    name VARCHAR(255)
);

import the sorted artist names from tracks to the artists table

INSERT INTO artists (name)
SELECT DISTINCT t.artist
FROM tracks t
WHERE t.artist IS NOT NULL
order by t.artist;

add a artist_id INT colun in trees

ALTER TABLE tracks ADD COLUMN artist_id INTEGER;

reconcile both tables by updating the trees.artist_id with the correct artists.id

UPDATE tracks t
SET artist_id = a.id
FROM artists a
WHERE t.artist = a.name;

make the trees.artists_id a foreign key referencing the artists table

ALTER TABLE tracks
ADD CONSTRAINT fk_tracks_artists
FOREIGN KEY (artist_id)
REFERENCES artists(id);

check that there is no gap between the 2 tables: count the number of rows where trees.artist != artists.name

select count(*)
from tracks t
join artists a on a.id = t.artist_id
where t.artist != a.name;

delete the artist column in the tracks table

ALTER TABLE tracks
DROP COLUMN artist;

VI Update the queries

how many artists?

select count() from artists;

number of tracks per artist?

SELECT a.name AS artist_name, COUNT(t.id) AS track_count
FROM artists a
JOIN tracks t ON a.id = t.artist_id
GROUP BY a.name;

Who are the artists with 1 track only?

SELECT artist_name
FROM (
    SELECT a.name AS artist_name, COUNT(t.id) AS track_count
    FROM artists a
    JOIN tracks t ON a.id = t.artist_id
    GROUP BY a.name
) AS artist_track_counts
WHERE track_count = 1;

what is the average track duration per artist?

SELECT a.name AS artist_name,
    ROUND(AVG(t.duration),2) AS avg_duration
FROM artists a
JOIN tracks t ON a.id = t.artist_id
GROUP BY a.name;