← Back to inwai

Investigating with AI

3 min read

Pandas practice on the IMDB 1000 movies dataset

data dictionnary

  • Movie Name: The title of the movie.
  • Certificate: The certificate or rating assigned to the movie.
  • Duration: The duration of the movie in minutes.
  • Genre: The genre(s) to which the movie belongs.
  • IMDb Rating: The IMDb rating of the movie.
  • Metascore: The Metascore rating of the movie.
  • Director: The director of the movie.
  • Stars 1, 2, 3: The main cast members of the movie.
  • Votes: The number of user votes/ratings the movie has received.
  • Gross in $: The gross earnings in dollars (if available).
  • Plot: A brief summary or plot description of the movie.
  • Size: The dataset contains 1000 rows and 11 columns.

1. First Look at the Data

Q1. Display the first 5 rows of the dataset.

  • Hint: Use .head().

Q2. What are the column names?

  • Hint: Use .columns. (no parenthesis)

Q3. What is the dimension of the dataset?

  • Hint: Use .shape.

Q4. What is the data type of each column?

  • Hint: Use .dtypes or .info().

Q5. What is the number of missing values in each column?

  • Hint: Use .isnull().sum().

Q6. What is the average in each column?

  • Hint: Use .isnull().mean().

Q7. who is the most common director?

  • Hint: Use .value_counts().

Q8. what is the most common genre?

This is more complex as we need to concatenate the genres and then count the most common ones.

2. Selecting Columns (Projection)

Q3. Display only the Series_Title column.

Q4. Display the Series_Title and IMDB_Rating columns together.

  • Goal: Select multiple columns.

3. Simple Filtering (Masks on Rows)

Q5. Show all movies with an IMDb rating higher than 9.0.

  • Goal: Apply a condition on rows.

Q6. Show all movies released after 2010.

  • Goal: Filter based on numeric values.

Q7. Show movies with the certificate "PG-13".

  • Goal: Filter based on text values.

4. Exploring the Data

Q8. What is the highest IMDb rating in the dataset?

  • Goal: Use .max().

Q9. Which movie has the highest IMDb rating?

  • Goal: Combine .loc[] and .max().

Q10. Count how many movies are in each Certificate category.

  • Goal: Use .value_counts().

Q11. What is the average IMDb rating of all movies?

  • Goal: Use .mean().

Q12. What is the most common genre?

  • Goal: Use .mode() or .value_counts().

Q5. do recent movies have better ratings?

  • Goal: Apply a condition on rows.

5. Sorting Data

Q13. List the top 10 movies with the highest IMDb rating.

  • Goal: Use .sort_values().

Q14. List the top 10 movies with the highest number of votes.

6. Visualization

(use matplotlib or pandas built-in plotting)

Q15. Plot a histogram of IMDb ratings.

  • Goal: Show how ratings are distributed.

Q16. Create a scatter plot of IMDB_Rating (y-axis) vs. Released_Year (x-axis).

  • Goal: Check how ratings are spread across time.

Q17. Create a bar plot of the top 5 genres by number of movies.

  • Goal: Show categories visually.

7. String to minutes

Q18. Convert the Runtime column (e.g. "142 min") into numeric minutes and find the average runtime.

  • Goal: Work with text → numbers.

Q19. What are the top 5 movies with the highest gross revenue?

  • Goal: Convert Gross to numeric and sort.