DHUM 25A43 - Fall 2025 Course Introduction
Pandas practice on the IMDB 1000 movies dataset
-
data is available here imdb_top_1000.csv
-
data is also available at : https://www.kaggle.com/datasets/mayankray/imdb-top-1000-movies-dataset
data dictionnary
- Movie Name: The title of the movie.
- Certificate: The certificate or rating assigned to the movie.
- Duration: The duration of the movie in minutes.
- Genre: The genre(s) to which the movie belongs.
- IMDb Rating: The IMDb rating of the movie.
- Metascore: The Metascore rating of the movie.
- Director: The director of the movie.
- Stars 1, 2, 3: The main cast members of the movie.
- Votes: The number of user votes/ratings the movie has received.
- Gross in $: The gross earnings in dollars (if available).
- Plot: A brief summary or plot description of the movie.
- Size: The dataset contains 1000 rows and 11 columns.
1. First Look at the Data
Q1. Display the first 5 rows of the dataset.
- Hint: Use
.head()
.
Q2. What are the column names?
- Hint: Use
.columns
. (no parenthesis)
Q3. What is the dimension of the dataset?
- Hint: Use
.shape
.
Q4. What is the data type of each column?
- Hint: Use
.dtypes
or.info()
.
Q5. What is the number of missing values in each column?
- Hint: Use
.isnull().sum()
.
Q6. What is the average in each column?
- Hint: Use
.isnull().mean()
.
Q7. who is the most common director?
- Hint: Use
.value_counts()
.
Q8. what is the most common genre?
This is more complex as we need to concatenate the genres and then count the most common ones.
2. Selecting Columns (Projection)
Q3. Display only the Series_Title
column.
Q4. Display the Series_Title
and IMDB_Rating
columns together.
- Goal: Select multiple columns.
3. Simple Filtering (Masks on Rows)
Q5. Show all movies with an IMDb rating higher than 9.0.
- Goal: Apply a condition on rows.
Q6. Show all movies released after 2010.
- Goal: Filter based on numeric values.
Q7. Show movies with the certificate "PG-13"
.
- Goal: Filter based on text values.
4. Exploring the Data
Q8. What is the highest IMDb rating in the dataset?
- Goal: Use
.max()
.
Q9. Which movie has the highest IMDb rating?
- Goal: Combine
.loc[]
and.max()
.
Q10. Count how many movies are in each Certificate
category.
- Goal: Use
.value_counts()
.
Q11. What is the average IMDb rating of all movies?
- Goal: Use
.mean()
.
Q12. What is the most common genre?
- Goal: Use
.mode()
or.value_counts()
.
Q5. do recent movies have better ratings?
- Goal: Apply a condition on rows.
5. Sorting Data
Q13. List the top 10 movies with the highest IMDb rating.
- Goal: Use
.sort_values()
.
Q14. List the top 10 movies with the highest number of votes.
6. Visualization
(use matplotlib or pandas built-in plotting)
Q15. Plot a histogram of IMDb ratings.
- Goal: Show how ratings are distributed.
Q16. Create a scatter plot of IMDB_Rating
(y-axis) vs. Released_Year
(x-axis).
- Goal: Check how ratings are spread across time.
Q17. Create a bar plot of the top 5 genres by number of movies.
- Goal: Show categories visually.
7. String to minutes
Q18. Convert the Runtime
column (e.g. “142 min”) into numeric minutes and find the average runtime.
- Goal: Work with text → numbers.
Q19. What are the top 5 movies with the highest gross revenue?
- Goal: Convert
Gross
to numeric and sort.