Normalization Lab Solutions
normalization, spotify dataset, SQL aggregation
solution to S04 practice on normalization
Part 1 : explore the data
- how many artists?
select count(distinct artist) from tracks;
- number of tracks per artist?
SELECT artist, COUNT(*) AS track_count
FROM tracks
GROUP BY artist
ORDER BY track_count desc;
- Who are the artists with 1 track only?
SELECT artist
FROM tracks
GROUP BY artist
HAVING COUNT(*) = 1;
or re-using the query from Q2:
SELECT artist
FROM (
SELECT artist, COUNT(*) AS track_count
FROM tracks
GROUP BY artist
) AS artist_counts
WHERE track_count = 1;
- what is the average track duration per artist?
SELECT artist, AVG(duration) AS avg_duration
FROM tracks
GROUP BY artist;
II Why normalize?
Why is the artist column a good candidate for normalization ? Think of anomalies and list 2 reasons.
Insertion error: adding an artist with no track forces us to add null values to many rows Update error: if an artist changes name, then we have to update multiple tracks at once Deletion error: multiple artist have only one track. so if we delete the traclk, the aritst name diseapers.
So the artist
column checks all 3 types of anomalies.
III Normalization
- create a new artists table: id as primary key and name as text (or varchar)
CREATE table artists (
id serial primary key,
name VARCHAR(255)
);
- import the sorted artist names from tracks to the artists table
INSERT INTO artists (name)
SELECT DISTINCT t.artist
FROM tracks t
WHERE t.artist IS NOT NULL
order by t.artist;
- add a artist_id INT colun in trees
ALTER TABLE tracks ADD COLUMN artist_id INTEGER;
- reconcile both tables by updating the trees.artist_id with the correct artists.id
UPDATE tracks t
SET artist_id = a.id
FROM artists a
WHERE t.artist = a.name;
- make the trees.artists_id a foreign key referencing the artists table
ALTER TABLE tracks
ADD CONSTRAINT fk_tracks_artists
FOREIGN KEY (artist_id)
REFERENCES artists(id);
- check that there is no gap between the 2 tables: count the number of rows where trees.artist != artists.name
select count(*)
from tracks t
join artists a on a.id = t.artist_id
where t.artist != a.name;
- delete the artist column in the tracks table
ALTER TABLE tracks
DROP COLUMN artist;
VI Update the queries
- how many artists?
select count() from artists;
- number of tracks per artist?
SELECT a.name AS artist_name, COUNT(t.id) AS track_count
FROM artists a
JOIN tracks t ON a.id = t.artist_id
GROUP BY a.name;
- Who are the artists with 1 track only?
SELECT artist_name
FROM (
SELECT a.name AS artist_name, COUNT(t.id) AS track_count
FROM artists a
JOIN tracks t ON a.id = t.artist_id
GROUP BY a.name
) AS artist_track_counts
WHERE track_count = 1;
- what is the average track duration per artist?
SELECT a.name AS artist_name,
ROUND(AVG(t.duration),2) AS avg_duration
FROM artists a
JOIN tracks t ON a.id = t.artist_id
GROUP BY a.name;