SELECT Queries

Mastering Data Retrieval with IMDB Top 1000

The full dataset is available here

import the dataset

create a new database called moviesdb
create the table movies with the right columns
import the dataset with psql or in pgadmin

Our Dataset: IMDB Top 1000

The full dataset is available here

Columns:

Series_Title: Movie name
Released_Year: When it came out
Runtime: Duration in minutes
Genre: Movie categories
IMDB_Rating: Score (0-10)
Director: Who directed it
Star1, Star2, Star3, Star4: Main actors
No_of_Votes: How many people rated
Gross: Box office earnings

Real movies, real data, real queries

Create the database

we can do that with

CREATE database moviesdb;

The connect to the database

\c moviesdb

Create the table movies

how to avoid having to specify all the columns to crete the right table

=> chatGPT!

just upload the csv file and ask for :

here is the imdb_top_1000 dataset in csv format,
write the sql query that creates a table called movies in postgresql
- only use the `TEXT` data type
- keep the table definition simple
- do not assume any preprocessing on load
keep the table definition as simple as possible

for the moment I am just goin to create the table with TEXT as the unique data type and no primary key

CREATE TABLE movies(
    poster_link     TEXT,
    Title    TEXT,
    released_year   TEXT,
    certificate     TEXT,
    runtime         TEXT,
    genre           TEXT,
    imdb_rating     TEXT,
    overview        TEXT,
    meta_score      TEXT,
    director        TEXT,
    star1           TEXT,
    star2           TEXT,
    star3           TEXT,
    star4           TEXT,
    no_of_votes     TEXT,
    gross           TEXT
);

review and check the table

Check that the table is created

\d movies

Several things stand out

SERIAL for the primary key. as expected
mostly TEXT data type. The more robust.
multiple columns have TEXT data type. for instance runtime, released_year, certificate, gross. the original values are indeed strings: “142 min”, “2000”, “PG-13”, “$28,341,469”.
no constraints on the columns (Unicity, not NULL, etc )

import the dataset

\copy movies FROM './imdb_top_1000.csv' DELIMITER ',' CSV HEADER;

make sure the csv file is in the right directory

check

SELECT COUNT(*) FROM movies;

or get a random sample

select * from movies order by random() limit 1;

Actors !

We have 4 stars columns : Star1, Star2, Star3, Star4.

This is not good database design

what if we want to add a fifth actor to a certain movie?

we need a new column
all the other movies will have NULL values in this column

So we need to create a new table for actors

and the relation between actors and movies is many to many

an actor can be in many movies
a movie can have many actors

This is called normalization. we’ll define exactly what that is in the next session.

For the moment let’s explore the dataset

The database is available

You can download the database dump in plain format here

and restore it with

pg_restore \
  --host "localhost" \
  --port "5432" \
  --username "alexis" \
  --dbname "moviesdb" \
  --verbose \
  "/Users/alexis/imdb.dump"

before restoring the database, make sure you have created the database

CREATE database moviesdb;

or

createdb -h localhost -p 5432 -U <superuser> moviesdb

primary key ?

The table has no primary key

Let’s add one

-- 1) Add a new serial column
ALTER TABLE movies
ADD COLUMN id SERIAL;

-- 2) Make it the primary key
ALTER TABLE movies
ADD PRIMARY KEY (id);

less TEXT

many columns hsould be numbers or booleans , let’s change the data type

no_of_votes, meta_score and released_year should be INTs

We need to handle null values. However none of these columns have null values

select count(*) from movies where released_year IS  NULL or released_year = '';
select count(*) from movies where meta_score IS NULL or meta_score = '';
select count(*) from movies where no_of_votes IS NULL or no_of_votes = '';

Some columns returns 0 (no missing values), others returns more than 0.

let’s convert the no_of_votes column to INT

ALTER TABLE movies
ALTER COLUMN no_of_votes TYPE INT
USING no_of_votes::INT;

This query will convert the empty string to a null value:

ALTER TABLE movies
ALTER COLUMN meta_score TYPE INT
USING NULLIF(meta_score, '')::INT;

Then we can calculate the average number of votes

SELECT avg(no_of_votes) FROM movies;

do the same for released_year

now let’s handle IMDB_Rating

ALTER TABLE movies
ALTER COLUMN imdb_rating TYPE FLOAT
USING imdb_rating::FLOAT;

rename columns

the movie title is called Series_Title in the dataset. It’s too long, I’ll rename it to title

ALTER TABLE movies
ALTER COLUMN Series_Title
RENAME TO Title;

Projection: Choosing Your Columns

SELECT Specific Fields

Projection = Selecting which columns to display

Like choosing camera angles in filmmaking:

Wide shot: SELECT * (all columns)
Close-up: SELECT title (one column)
Multiple angles: SELECT title, director, year (specific columns)

Don’t fetch data you don’t need!

Avoid select *

Demo: Projection

-- Get everything (memory heavy, often unnecessary)
SELECT * FROM movies;

-- Just the essentials
SELECT Title, Released_Year, IMDB_Rating
FROM movies;

-- Director's filmography view
SELECT Title, Director, Released_Year
FROM movies;

-- Quick rating check
SELECT Title AS "Movie",
       IMDB_Rating AS "Score"
FROM movies;

Filtering: Finding the Right Rows

WHERE Clause

Filtering = Selecting which rows to include

Like Netflix filters:

By genre: WHERE Genre LIKE '%Action%'
By year: WHERE Released_Year > 2000
By rating: WHERE IMDB_Rating >= 8.0

The WHERE clause is your search bar

Demo: Filtering Basics

-- Modern classics (2000s and later)
SELECT Title, Released_Year, IMDB_Rating
FROM movies
WHERE Released_Year >= 2000;

-- Highly rated films
SELECT Title, IMDB_Rating, Director
FROM movies
WHERE IMDB_Rating > 8.5;

-- Christopher Nolan films
SELECT Title, Released_Year, IMDB_Rating
FROM movies
WHERE Director = 'Christopher Nolan';


---

# Filtering: Comparison Operators

| Operator | Meaning | Example |
|----------|---------|---------|
| `=` | Equals | `WHERE Director = 'Spielberg'` |
| `!=` or `<>` | Not equals | `WHERE Certificate != 'R'` |
| `>`, `>=` | Greater than (or equal) | `WHERE IMDB_Rating >= 8.0` |
| `<`, `<=` | Less than (or equal) | `WHERE Released_Year <= 1980` |
| `BETWEEN` | Range | `WHERE Released_Year BETWEEN 1990 AND 2000` |
| `IN` | List of values | `WHERE Director IN ('Nolan', 'Spielberg')` |
| `LIKE` | Pattern match | `WHERE Genre LIKE '%Sci-Fi%'` |

---

# Demo: Advanced Filtering

90s movies (using BETWEEN)

```sql
SELECT Title, Released_Year
FROM movies
WHERE Released_Year BETWEEN 1990 AND 1999;

Multiple directors (using IN)

SELECT Title, Director, IMDB_Rating
FROM movies
WHERE Director IN ('Quentin Tarantino', 'Martin Scorsese', 'Christopher Nolan');

Sci-Fi movies (using LIKE with wildcards)

SELECT Title, Genre, IMDB_Rating
FROM movies
WHERE Genre LIKE '%Sci-Fi%';

Movies with Tom Hanks in any role

SELECT Title, Star1, Star2, Star3, Star4
FROM movies
WHERE Star1 = 'Tom Hanks'
   OR Star2 = 'Tom Hanks'
   OR Star3 = 'Tom Hanks'
   OR Star4 = 'Tom Hanks';

Combining Filters: AND, OR, NOT

AND: All conditions must be true OR: At least one condition must be true NOT: Inverts the condition

-- Parentheses control precedence!
WHERE (condition1 AND condition2) OR condition3

-- Different from:
WHERE condition1 AND (condition2 OR condition3)

Demo: Complex Filtering

Recent AND highly rated

SELECT Title, Released_Year, IMDB_Rating
FROM movies
WHERE Released_Year >= 2010
  AND IMDB_Rating >= 8.0;

Action OR Adventure movies from the 2000s

SELECT Title, Genre, Released_Year
FROM movies
WHERE (Genre LIKE '%Action%' OR Genre LIKE '%Adventure%')
  AND Released_Year >= 2000;

NOT a sequel (no numbers in title)

SELECT Title
FROM movies
WHERE Title NOT LIKE '%2%'
  AND Title NOT LIKE '%II%'
  AND Title NOT LIKE '%Part%';

High-rated dramas that aren't too long

```sql
SELECT Title, Genre, Runtime, IMDB_Rating
FROM movies
WHERE Genre LIKE '%Drama%'
  AND IMDB_Rating >= 8.0
  AND Runtime < 150;

ORDER BY: Sorting Your Results

Control the Sequence

Default order is unpredictable - ORDER BY gives you control:

ASC: Ascending (A→Z, 0→9) - Default
DESC: Descending (Z→A, 9→0)

Can sort by multiple columns (like sorting Excel)

Demo: ORDER BY

Top rated movies first

SELECT Title, IMDB_Rating
FROM movies
ORDER BY IMDB_Rating DESC;

Oldest movies first

SELECT Title, Released_Year
FROM movies
ORDER BY Released_Year ASC;  -- ASC is optional (default)

Alphabetical by title

SELECT Title, Director
FROM movies
ORDER BY Title;

Multi-level sort: Year (newest first), then rating (highest first)

SELECT Title, Released_Year, IMDB_Rating
FROM movies
ORDER BY Released_Year DESC, IMDB_Rating DESC;

Director’s best work (sort by director, then their best films)

SELECT Director, Title, IMDB_Rating
FROM movies
ORDER BY Director, IMDB_Rating DESC;

LIMIT: Controlling Result Size

Don’t Drown in Data

LIMIT restricts how many rows are returned:

Useful for previews
Essential for performance

PostgreSQL: LIMIT n

Demo: LIMIT

Top 10 highest rated movies

SELECT Title, IMDB_Rating
FROM movies
ORDER BY IMDB_Rating DESC
LIMIT 10;

5 most recent movies

SELECT Title, Released_Year
FROM movies
ORDER BY Released_Year DESC
LIMIT 5;

Preview the data (common for exploration)

SELECT * FROM movies
LIMIT 3;

Page 2 of results (items 11-20) using OFFSET

SELECT Title, IMDB_Rating
FROM movies
ORDER BY IMDB_Rating DESC
LIMIT 10 OFFSET 10;

Bottom 5 rated movies (combine ORDER BY and LIMIT)

SELECT Title, IMDB_Rating
FROM movies
ORDER BY IMDB_Rating ASC
LIMIT 5;

String Operations: CONCAT

Combining Text Fields

|| or CONCAT() joins strings together:

-- PostgreSQL/SQLite
column1 || ' ' || column2

Like using + for strings in Python

Demo: String Concatenation

Create a “Movie (Year)” format

SELECT Title || ' (' || Released_Year || ')' AS movie_info
FROM movies
LIMIT 10;

Build a credits line

SELECT Title,
       'Directed by ' || Director AS credits
FROM movies
LIMIT 10;

Create a star list

SELECT Title,
       Star1 || ', ' || Star2 || ', ' || Star3 AS main_cast
FROM movies
LIMIT 10;

Make a movie description

SELECT Title || ': A ' || Runtime || ' minute ' || Genre || ' film' AS description
FROM movies
WHERE IMDB_Rating > 8.5
LIMIT 5;

PostgreSQL also supports CONCAT function

SELECT CONCAT(Director, ' directed ', Title) AS movie_fact
FROM movies
LIMIT 10;

COALESCE: Handling NULL Values

Your NULL Safety Net

COALESCE(value1, value2, ...) returns the first non-NULL value and reverts to a default value if all values are NULL.

For instance

Returns phone if not NULL
Returns email if phone is NULL
Returns ‘No contact’ if both phone and email are NULL

COALESCE(phone, email, 'No contact')

Essential for dealing with missing data

Demo: COALESCE

Replace NULL gross earnings with ‘Not Available’

SELECT Title,
       COALESCE(Gross, 'Not Available') AS box_office
FROM movies
LIMIT 20;

Handle missing Metascores

SELECT Title,
       IMDB_Rating,
       COALESCE(Meta_score, 0) AS critic_score
FROM movies;

Fallback for missing cast members

SELECT Title,
       COALESCE(Star4, Star3, Star2, Star1, 'No cast info') AS supporting_actor
FROM movies
LIMIT 20;

Calculate revenue with NULL handling

SELECT Title,
       COALESCE(Gross, '0') AS revenue,
       No_of_Votes
FROM movies
ORDER BY No_of_Votes DESC
LIMIT 10;

Smart NULL replacement for display

SELECT Title,
       Director,
       COALESCE(Certificate, 'Unrated') AS rating
FROM movies
WHERE Released_Year > 2010;

CAST: Converting Data Types

Shape-Shifting Your Data

CAST(expression AS type) or ::type (PostgreSQL)

Common conversions:

String to number: CAST('123' AS INTEGER)
Number to string: CAST(2024 AS TEXT)
String to date: CAST('2024-01-15' AS DATE)

Like int() or str() in Python

Demo: Type Casting

PostgreSQL shorthand with ::

SELECT Title || ' (' || Released_Year::TEXT || ')' AS movie_label
FROM movies
LIMIT 10;

Convert runtime to hours (decimal)

SELECT Title,
       Runtime,
       CAST(Runtime AS FLOAT) / 60 AS hours
FROM movies
WHERE Runtime > 180;

Clean up Gross (remove commas, convert to numeric)

Note: Gross might have commas that need cleaning first

SELECT Title,
       Gross,
       CAST(REPLACE(Gross, ',', '') AS BIGINT) AS gross_numeric
FROM movies
WHERE Gross IS NOT NULL
LIMIT 10;

Convert rating to integer for grouping

SELECT Title,
       IMDB_Rating,
       CAST(IMDB_Rating AS INTEGER) AS rating_bucket
FROM movies
ORDER BY IMDB_Rating DESC
LIMIT 20;

Combining It All Together

The Power of SELECT

Find top modern action movies with complete info

SELECT
    Title || ' (' || Released_Year::TEXT || ')' AS movie,
    'Dir: ' || Director AS director,
    COALESCE(Star1, 'Unknown') || ', ' || COALESCE(Star2, '') AS stars
    IMDB_Rating
FROM movies
WHERE Genre LIKE '%Action%'
    AND Released_Year >= 2000
    AND IMDB_Rating >= 7.0
ORDER BY IMDB_Rating DESC, Released_Year DESC
LIMIT 15;

This one query uses everything we learned!

Common Patterns

Pattern 1: Top N by category

SELECT * FROM movies
WHERE Genre LIKE '%Horror%'
ORDER BY IMDB_Rating DESC
LIMIT 10;

Pattern 2: NULL-safe concatenation

SELECT COALESCE(Star1, 'Unknown') || ' in ' || Title
FROM movies;

Pattern 3: Year range analysis

SELECT * FROM movies
WHERE CAST(Released_Year AS INTEGER) BETWEEN 2010 AND 2020
ORDER BY Released_Year;

Pattern 4: Multi-condition filtering

SELECT * FROM movies
WHERE Runtime < 120
  AND IMDB_Rating > 8.0
  AND Released_Year >= 2000;

Exercise

Find all 90s comedies over 8.0 rating
List Spielberg movies sorted by year
Top 20 longest movies with their runtime in hours
Movies with Leonardo DiCaprio (check all 4 star columns)
Create a “movie card” with title, year, director, and stars concatenated

Remember

Projection (SELECT): Choose your columns
Filtering (WHERE): Choose your rows
Ordering (ORDER BY): Control the sequence
Limiting (LIMIT): Control the quantity
String ops (||, CONCAT): Combine text
COALESCE: Handle NULLs gracefully
CAST: Convert between types

Master these, and you can answer 80% of data questions