moviesdbmovies with the right columnsThe full dataset is available here
Columns:
Series_Title: Movie nameReleased_Year: When it came outRuntime: Duration in minutesGenre: Movie categoriesIMDB_Rating: Score (0-10)Director: Who directed itStar1, Star2, Star3, Star4: Main actorsNo_of_Votes: How many people ratedGross: Box office earningsReal movies, real data, real queries
we can do that with
CREATE database moviesdb;
The connect to the database
\c moviesdb
how to avoid having to specify all the columns to crete the right table
=> chatGPT!
just upload the csv file and ask for :
here is the imdb_top_1000 dataset in csv format,
write the sql query that creates a table called movies in postgresql
- only use the `TEXT` data type
- keep the table definition simple
- do not assume any preprocessing on load
keep the table definition as simple as possible
for the moment I am just goin to create the table with TEXT as the unique data type and no primary key
CREATE TABLE movies(
poster_link TEXT,
Title TEXT,
released_year TEXT,
certificate TEXT,
runtime TEXT,
genre TEXT,
imdb_rating TEXT,
overview TEXT,
meta_score TEXT,
director TEXT,
star1 TEXT,
star2 TEXT,
star3 TEXT,
star4 TEXT,
no_of_votes TEXT,
gross TEXT
);
Check that the table is created
\d movies
Several things stand out
SERIAL for the primary key. as expectedTEXT data type. The more robust.TEXT data type. for instance runtime, released_year, certificate, gross. the original values are indeed strings: “142 min”, “2000”, “PG-13”, “$28,341,469”.\copy movies FROM './imdb_top_1000.csv' DELIMITER ',' CSV HEADER;
make sure the csv file is in the right directory
SELECT COUNT(*) FROM movies;
or get a random sample
select * from movies order by random() limit 1;
We have 4 stars columns : Star1, Star2, Star3, Star4.
This is not good database design
what if we want to add a fifth actor to a certain movie?
So we need to create a new table for actors
and the relation between actors and movies is many to many
This is called normalization. we’ll define exactly what that is in the next session.
For the moment let’s explore the dataset
You can download the database dump in plain format here
and restore it with
pg_restore \
--host "localhost" \
--port "5432" \
--username "alexis" \
--dbname "moviesdb" \
--verbose \
"/Users/alexis/imdb.dump"
before restoring the database, make sure you have created the database
CREATE database moviesdb;
or
createdb -h localhost -p 5432 -U <superuser> moviesdb
The table has no primary key
Let’s add one
-- 1) Add a new serial column
ALTER TABLE movies
ADD COLUMN id SERIAL;
-- 2) Make it the primary key
ALTER TABLE movies
ADD PRIMARY KEY (id);
many columns hsould be numbers or booleans , let’s change the data type
no_of_votes, meta_score and released_year should be INTs
We need to handle null values. However none of these columns have null values
select count(*) from movies where released_year IS NULL or released_year = '';
select count(*) from movies where meta_score IS NULL or meta_score = '';
select count(*) from movies where no_of_votes IS NULL or no_of_votes = '';
Some columns returns 0 (no missing values), others returns more than 0.
let’s convert the no_of_votes column to INT
ALTER TABLE movies
ALTER COLUMN no_of_votes TYPE INT
USING no_of_votes::INT;
This query will convert the empty string to a null value:
ALTER TABLE movies
ALTER COLUMN meta_score TYPE INT
USING NULLIF(meta_score, '')::INT;
Then we can calculate the average number of votes
SELECT avg(no_of_votes) FROM movies;
do the same for released_year
now let’s handle IMDB_Rating
ALTER TABLE movies
ALTER COLUMN imdb_rating TYPE FLOAT
USING imdb_rating::FLOAT;
the movie title is called Series_Title in the dataset. It’s too long, I’ll rename it to title
ALTER TABLE movies
ALTER COLUMN Series_Title
RENAME TO Title;
Projection = Selecting which columns to display
Like choosing camera angles in filmmaking:
SELECT * (all columns)SELECT title (one column)SELECT title, director, year (specific columns)Don’t fetch data you don’t need!
Avoid
select *
-- Get everything (memory heavy, often unnecessary)
SELECT * FROM movies;
-- Just the essentials
SELECT Title, Released_Year, IMDB_Rating
FROM movies;
-- Director's filmography view
SELECT Title, Director, Released_Year
FROM movies;
-- Quick rating check
SELECT Title AS "Movie",
IMDB_Rating AS "Score"
FROM movies;
Filtering = Selecting which rows to include
Like Netflix filters:
WHERE Genre LIKE '%Action%'WHERE Released_Year > 2000WHERE IMDB_Rating >= 8.0The WHERE clause is your search bar
-- Modern classics (2000s and later)
SELECT Title, Released_Year, IMDB_Rating
FROM movies
WHERE Released_Year >= 2000;
-- Highly rated films
SELECT Title, IMDB_Rating, Director
FROM movies
WHERE IMDB_Rating > 8.5;
-- Christopher Nolan films
SELECT Title, Released_Year, IMDB_Rating
FROM movies
WHERE Director = 'Christopher Nolan';
---
# Filtering: Comparison Operators
| Operator | Meaning | Example |
|----------|---------|---------|
| `=` | Equals | `WHERE Director = 'Spielberg'` |
| `!=` or `<>` | Not equals | `WHERE Certificate != 'R'` |
| `>`, `>=` | Greater than (or equal) | `WHERE IMDB_Rating >= 8.0` |
| `<`, `<=` | Less than (or equal) | `WHERE Released_Year <= 1980` |
| `BETWEEN` | Range | `WHERE Released_Year BETWEEN 1990 AND 2000` |
| `IN` | List of values | `WHERE Director IN ('Nolan', 'Spielberg')` |
| `LIKE` | Pattern match | `WHERE Genre LIKE '%Sci-Fi%'` |
---
# Demo: Advanced Filtering
90s movies (using BETWEEN)
```sql
SELECT Title, Released_Year
FROM movies
WHERE Released_Year BETWEEN 1990 AND 1999;
Multiple directors (using IN)
SELECT Title, Director, IMDB_Rating
FROM movies
WHERE Director IN ('Quentin Tarantino', 'Martin Scorsese', 'Christopher Nolan');
Sci-Fi movies (using LIKE with wildcards)
SELECT Title, Genre, IMDB_Rating
FROM movies
WHERE Genre LIKE '%Sci-Fi%';
Movies with Tom Hanks in any role
SELECT Title, Star1, Star2, Star3, Star4
FROM movies
WHERE Star1 = 'Tom Hanks'
OR Star2 = 'Tom Hanks'
OR Star3 = 'Tom Hanks'
OR Star4 = 'Tom Hanks';
AND: All conditions must be true OR: At least one condition must be true NOT: Inverts the condition
-- Parentheses control precedence!
WHERE (condition1 AND condition2) OR condition3
-- Different from:
WHERE condition1 AND (condition2 OR condition3)
Recent AND highly rated
SELECT Title, Released_Year, IMDB_Rating
FROM movies
WHERE Released_Year >= 2010
AND IMDB_Rating >= 8.0;
Action OR Adventure movies from the 2000s
SELECT Title, Genre, Released_Year
FROM movies
WHERE (Genre LIKE '%Action%' OR Genre LIKE '%Adventure%')
AND Released_Year >= 2000;
NOT a sequel (no numbers in title)
SELECT Title
FROM movies
WHERE Title NOT LIKE '%2%'
AND Title NOT LIKE '%II%'
AND Title NOT LIKE '%Part%';
High-rated dramas that aren't too long
```sql
SELECT Title, Genre, Runtime, IMDB_Rating
FROM movies
WHERE Genre LIKE '%Drama%'
AND IMDB_Rating >= 8.0
AND Runtime < 150;
Default order is unpredictable - ORDER BY gives you control:
ASC: Ascending (A→Z, 0→9) - DefaultDESC: Descending (Z→A, 9→0)Can sort by multiple columns (like sorting Excel)
Top rated movies first
SELECT Title, IMDB_Rating
FROM movies
ORDER BY IMDB_Rating DESC;
Oldest movies first
SELECT Title, Released_Year
FROM movies
ORDER BY Released_Year ASC; -- ASC is optional (default)
Alphabetical by title
SELECT Title, Director
FROM movies
ORDER BY Title;
Multi-level sort: Year (newest first), then rating (highest first)
SELECT Title, Released_Year, IMDB_Rating
FROM movies
ORDER BY Released_Year DESC, IMDB_Rating DESC;
Director’s best work (sort by director, then their best films)
SELECT Director, Title, IMDB_Rating
FROM movies
ORDER BY Director, IMDB_Rating DESC;
LIMIT restricts how many rows are returned:
PostgreSQL: LIMIT n
Top 10 highest rated movies
SELECT Title, IMDB_Rating
FROM movies
ORDER BY IMDB_Rating DESC
LIMIT 10;
5 most recent movies
SELECT Title, Released_Year
FROM movies
ORDER BY Released_Year DESC
LIMIT 5;
Preview the data (common for exploration)
SELECT * FROM movies
LIMIT 3;
Page 2 of results (items 11-20) using OFFSET
SELECT Title, IMDB_Rating
FROM movies
ORDER BY IMDB_Rating DESC
LIMIT 10 OFFSET 10;
Bottom 5 rated movies (combine ORDER BY and LIMIT)
SELECT Title, IMDB_Rating
FROM movies
ORDER BY IMDB_Rating ASC
LIMIT 5;
|| or CONCAT() joins strings together:
-- PostgreSQL/SQLite
column1 || ' ' || column2
Like using + for strings in Python
Create a “Movie (Year)” format
SELECT Title || ' (' || Released_Year || ')' AS movie_info
FROM movies
LIMIT 10;
Build a credits line
SELECT Title,
'Directed by ' || Director AS credits
FROM movies
LIMIT 10;
Create a star list
SELECT Title,
Star1 || ', ' || Star2 || ', ' || Star3 AS main_cast
FROM movies
LIMIT 10;
Make a movie description
SELECT Title || ': A ' || Runtime || ' minute ' || Genre || ' film' AS description
FROM movies
WHERE IMDB_Rating > 8.5
LIMIT 5;
PostgreSQL also supports CONCAT function
SELECT CONCAT(Director, ' directed ', Title) AS movie_fact
FROM movies
LIMIT 10;
COALESCE(value1, value2, ...) returns the first non-NULL value and reverts to a default value if all values are NULL.
For instance
COALESCE(phone, email, 'No contact')
Essential for dealing with missing data
Replace NULL gross earnings with ‘Not Available’
SELECT Title,
COALESCE(Gross, 'Not Available') AS box_office
FROM movies
LIMIT 20;
Handle missing Metascores
SELECT Title,
IMDB_Rating,
COALESCE(Meta_score, 0) AS critic_score
FROM movies;
Fallback for missing cast members
SELECT Title,
COALESCE(Star4, Star3, Star2, Star1, 'No cast info') AS supporting_actor
FROM movies
LIMIT 20;
Calculate revenue with NULL handling
SELECT Title,
COALESCE(Gross, '0') AS revenue,
No_of_Votes
FROM movies
ORDER BY No_of_Votes DESC
LIMIT 10;
Smart NULL replacement for display
SELECT Title,
Director,
COALESCE(Certificate, 'Unrated') AS rating
FROM movies
WHERE Released_Year > 2010;
CAST(expression AS type) or ::type (PostgreSQL)
Common conversions:
CAST('123' AS INTEGER)CAST(2024 AS TEXT)CAST('2024-01-15' AS DATE)Like int() or str() in Python
PostgreSQL shorthand with ::
SELECT Title || ' (' || Released_Year::TEXT || ')' AS movie_label
FROM movies
LIMIT 10;
Convert runtime to hours (decimal)
SELECT Title,
Runtime,
CAST(Runtime AS FLOAT) / 60 AS hours
FROM movies
WHERE Runtime > 180;
Clean up Gross (remove commas, convert to numeric)
SELECT Title,
Gross,
CAST(REPLACE(Gross, ',', '') AS BIGINT) AS gross_numeric
FROM movies
WHERE Gross IS NOT NULL
LIMIT 10;
Convert rating to integer for grouping
SELECT Title,
IMDB_Rating,
CAST(IMDB_Rating AS INTEGER) AS rating_bucket
FROM movies
ORDER BY IMDB_Rating DESC
LIMIT 20;
Find top modern action movies with complete info
SELECT
Title || ' (' || Released_Year::TEXT || ')' AS movie,
'Dir: ' || Director AS director,
COALESCE(Star1, 'Unknown') || ', ' || COALESCE(Star2, '') AS stars
IMDB_Rating
FROM movies
WHERE Genre LIKE '%Action%'
AND Released_Year >= 2000
AND IMDB_Rating >= 7.0
ORDER BY IMDB_Rating DESC, Released_Year DESC
LIMIT 15;
This one query uses everything we learned!
Pattern 1: Top N by category
SELECT * FROM movies
WHERE Genre LIKE '%Horror%'
ORDER BY IMDB_Rating DESC
LIMIT 10;
Pattern 2: NULL-safe concatenation
SELECT COALESCE(Star1, 'Unknown') || ' in ' || Title
FROM movies;
Pattern 3: Year range analysis
SELECT * FROM movies
WHERE CAST(Released_Year AS INTEGER) BETWEEN 2010 AND 2020
ORDER BY Released_Year;
Pattern 4: Multi-condition filtering
SELECT * FROM movies
WHERE Runtime < 120
AND IMDB_Rating > 8.0
AND Released_Year >= 2000;
SELECT): Choose your columnsWHERE): Choose your rowsORDER BY): Control the sequenceLIMIT): Control the quantity||, CONCAT): Combine textMaster these, and you can answer 80% of data questions